Build vs Buy: Job Data Aggregation for Your Product

If you're building a job board, a sales intelligence tool, or any product that needs job posting data, you've probably already found that ATS platforms like Greenhouse, Lever, and Ashby expose public APIs. They're free, they return JSON, and they work in five lines of curl:

curl "https://api.greenhouse.io/v1/boards/stripe/jobs?content=true"

It feels like you could just wire this up and move on. Most teams start here. The question is what happens next — when you need data from not one ATS but dozens of sources, not one company but thousands, and not just raw job titles but normalized, enriched, deduplicated records your product can actually use.

This guide walks through the real engineering scope of building a job data pipeline, compares it to the cost of buying from an aggregator, and gives you a decision framework to choose the right path for your product.

What "building" actually looks like

Calling an ATS API is the easy part. The hard part is everything that comes after. Based on what it takes to run a production-grade job data pipeline, here are the six layers you'd need to build and maintain.

1. Crawling and ingestion

ATS APIs don't list their clients. To get job postings, you first need to discover which companies use which ATS — there's no directory for that. You'll resort to Google site: searches, scraping career pages, or manually collecting slugs and subdomains.

Once you have a list of companies, you need to poll each one, across every ATS you support. Here's the math:

10,000 companies across 9 ATS platforms = 90,000 API calls per polling cycle
Each ATS has different rate limits: Greenhouse allows 50 requests per 10 seconds, Lever has no published limits, Ashby uses a different auth model entirely
You need retries, exponential backoff, and error handling per ATS
You need to detect whether a job disappeared because it was filled or because the API had a temporary outage

At scale, this becomes a full crawling infrastructure — schedulers, queues, proxies, and monitoring dashboards. The kind of infrastructure where you need at least one engineer maintaining it full-time.

2. Cleaning and standardization

Every ATS returns data differently. The same information lives in different fields with different formats:

Field	Greenhouse	Lever	Ashby
Location	`location.name`	`categories.location`	`address`
Department	`departments[].name`	`categories.team`	`department`
Description	HTML with inline styles	HTML	HTML or Markdown
Compensation	Not in public API	Not in public API	`?includeCompensation=true`

You'll need to build a normalizer per ATS. But it doesn't stop at field mapping:

Job descriptions arrive as messy HTML — inline styles,   entities, broken tags. You need to convert them to clean, consistent Markdown or plain text so your product can render and search them reliably.
Job titles contain noise: ★★ MARKETING MANAGER ★★ - Apply Now!!! or Sr. Software Engineer (Remote) - #12345. You need regex cleanup and standardization per source.
Locations are the most frustrating: "San Francisco, CA" on one platform, "SF, California" on another, "US-CA-San Francisco" on a third. You'll need a geocoding pipeline that parses freeform location strings into structured city, state, country, and coordinates.
Company logos from source URLs break constantly due to token expiration, hotlinking restrictions, or CORS policies. You'll either host them yourself or live with broken images.
Industry labels vary across sources: the same company might be "Software Development" on LinkedIn, "Technology" on Indeed, and "SaaS" on their own website. You need a taxonomy to normalize them.

3. Entity and signal extraction

Raw job postings are just text. To make them useful, you need to extract structured entities:

Technology keywords: detect programming languages, frameworks, databases, and cloud platforms mentioned in the title and description. This powers technographic filtering — finding companies that use React, Kubernetes, or Snowflake.
Hiring manager identification: some ATS metadata includes the hiring contact. Extracting and associating this with the job lets your users reach the decision-maker directly.
Buying signals: "migrating from on-premise to cloud", "implementing SOC 2 compliance", "scaling the data team" — these phrases in job descriptions reveal what a company is investing in. Extracting them requires NLP or at minimum a curated keyword taxonomy.

This is where job data becomes valuable intelligence rather than just a list of openings. It's also where engineering effort compounds — each new signal type needs its own extraction pipeline, validation, and testing.

4. Deduplication

This is the problem most teams underestimate the most. When a company posts a job through their ATS, that listing typically appears on 3 to 5 platforms simultaneously — their career page, LinkedIn, Indeed, Glassdoor, and possibly niche job boards. Job boards also scrape each other, creating further duplication.

If you don't deduplicate, your users see the same role multiple times, your analytics are inflated, and your sales signals fire repeatedly for the same opening.

Building deduplication requires:

Fuzzy matching on title + company + location (exact matching won't work because the same job has different titles across sources)
A canonical job ID system that links all references to a single posting
Handling near-duplicates: is "Senior Software Engineer" and "Sr. Software Engineer" at the same company the same role, or two different openings?
Cross-source timing: a job might appear on Indeed three days after it was posted on Greenhouse — your system needs to handle this gracefully

5. Enrichment

Job postings alone are surprisingly thin. A typical ATS API response gives you: title, location, description, and maybe a department. That's it.

To make the data useful for sales intelligence, lead generation, or market research, you need to enrich it with company-level data:

Company size (headcount)
Industry
Revenue estimates
Funding details (stage, amount, investors)
Headquarters location
LinkedIn profile and domain

This means integrating with one or more company data providers, building an entity resolution pipeline (matching "Stripe" on Lever to "Stripe, Inc." in your company database), and handling cases where enrichment data is missing or conflicting.

You also need salary normalization. Some postings include compensation in local currencies — £45,000 in London, €55,000 in Berlin, $65,000 in New York. To enable meaningful filtering and comparison, you need real-time currency conversion and a system that preserves both the original and normalized amounts.

6. Quality control

Every layer above will break. ATS platforms change their API schemas. Sources change their HTML structure. Geocoding services return wrong coordinates. Enrichment providers go down or return stale data.

You need:

Automated monitoring that tracks anomalies: unexpected drops in job volume, spikes in duplicates, missing fields, broken source integrations
Alerting that triggers immediately when thresholds are breached
Manual review processes where analysts audit samples to verify normalization, deduplication, and enrichment accuracy
A reprocessing pipeline to fix affected records when issues are found

Greenhouse, for example, is deprecating Harvest API v1 and v2 by August 2026. If you've built on those endpoints, you'll need to migrate to v3 — and that's just one ATS. Multiply this by every source you support.

Total cost of ownership: build vs buy

Let's put numbers to it. These estimates are based on industry research on data pipeline costs and the engineering scope described above.

Building in-house (24-month TCO, 3 ATS sources)

Cost dimension	Low estimate	High estimate
Integration development (per ATS, 6-10 weeks each)	$75,000 (3 x $25k)	$225,000 (3 x $75k)
Pipeline infrastructure (queues, schedulers, storage)	$15,000	$30,000
Normalization and enrichment pipeline	$20,000	$50,000
Deduplication system	$10,000	$25,000
Quality assurance and monitoring	$10,000	$20,000
Hosting and compute (24 months)	$12,000	$36,000
Data engineer maintenance (44% of 1 FTE, 24 months)	$105,000	$175,000
Total	$247,000	$561,000

That 44% figure isn't made up — research shows data engineers spend nearly half their time building and maintaining pipelines rather than doing higher-value work.

And this estimate covers only 3 ATS sources. Real-world coverage often requires 10+ sources (Greenhouse, Lever, Ashby, Workable, SmartRecruiters, BambooHR, plus LinkedIn, Indeed, Glassdoor, and others). Each additional source adds development, normalization, and ongoing maintenance cost.

Buying from an aggregator (24-month TCO with TheirStack)

Volume	Monthly cost	24-month total
1,500 jobs/month	$59	$1,416
10,000 jobs/month	$169	$4,056
100,000 jobs/month	$600	$14,400
1,000,000 jobs/month	$1,500	$36,000

That includes data from 350k sources across 195 countries, with deduplication, company enrichment, normalization, and quality control already handled. Zero infrastructure to build or maintain.

Side-by-side

	Build (3 ATS, 24 months)	Buy (all sources, 24 months)
Cost	$247,000 - $561,000	$1,416 - $36,000
Time to first data	3-6 months	Same day
Sources covered	3 ATS	350k+ sources
Ongoing maintenance	1-2 engineers	None
Deduplication	You build it	Built-in
Company enrichment	You build it	Built-in
Salary normalization	You build it	Built-in

When building makes sense

There are legitimate reasons to build your own pipeline. Be honest about whether they apply to you:

You need candidate write-back. If your product submits applications back to ATS platforms (not just reads job data), you need direct ATS integrations. Aggregators like TheirStack are read-only — they provide job posting data, not candidate management.
You need exactly one ATS. If all your customers use Greenhouse and you just need their job postings, a single direct integration might be simpler than adding a provider dependency.
You need custom parsing that no provider offers. If you're doing specialized NLP on job descriptions that requires access to the raw, unprocessed text in a specific format — though this is rare, since most providers return the full description.
Regulatory requirements mandate full pipeline control. Some industries require you to own the entire data processing chain for compliance reasons.
You have idle data engineering capacity. If you already have data engineers with nothing else to work on, building can be a reasonable use of their time. But consider the opportunity cost — could they be building features that directly move your product forward?

When buying makes sense

For most teams, buying is the faster and cheaper path:

You need read-only job data. The vast majority of use cases — job boards, sales intelligence, market research, recruiting tools — only need to read job postings, not write back to ATS platforms.
You need multiple sources. As soon as you go beyond 2-3 ATS platforms, the integration and maintenance burden grows faster than linearly. An aggregator gives you hundreds of sources from day one.
Speed to market matters. If you'd rather ship your product this month than spend three months building plumbing, an API call gets you there immediately.
You want company-level filters. Filtering jobs by company size, industry, funding stage, or tech stack requires enrichment that's a project in itself. Aggregators that include company data give you this out of the box.
You need deduplicated data. Cross-source deduplication is a harder engineering problem than most teams expect. If your users see the same job three times, that's a product quality issue.
You want historical data. Building a pipeline gives you data from the day you start. Aggregators like TheirStack have archives going back to 2021.

Decision scorecard

Answer these five questions. If you answer "yes" to three or more, you should probably buy.

#	Question	Yes → Buy	No → Consider building
1	Do you need job data from more than 2 sources?	You'll spend months building and maintaining connectors	A single ATS integration may suffice
2	Is your core product something other than a data pipeline?	Your engineers should build your product, not plumbing	If data infrastructure is your product, build it
3	Do you need company attributes (size, funding, tech stack) alongside job data?	Enrichment is a separate engineering project	Raw job data without enrichment may be enough
4	Do you need deduplicated data across sources?	Cross-source dedup is harder than it sounds	Single-source data doesn't need dedup
5	Do you need to ship within weeks, not months?	First API call in minutes vs months of building	If timeline is flexible, building is an option

The hybrid approach

Build vs buy doesn't have to be binary. The most pragmatic teams do both:

Start with an aggregator to get broad coverage immediately. Use it to validate your product, acquire customers, and understand what data you actually need.
Add direct ATS integrations only where you truly need them — typically for candidate write-back or deep ATS-specific features that an aggregator doesn't cover.
Use the aggregator as your baseline and layer custom integrations on top for high-value edge cases.

This gives you the best of both worlds: broad coverage and fast time-to-market from the aggregator, plus deep control where it matters most. Our guide on ATS platforms with public APIs covers the specific endpoints you'd use for direct integrations.

Getting started

Here's what it looks like to go from zero to job data with TheirStack's API:

curl -X POST "https://api.theirstack.com/v1/jobs/search" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "job_title_or": ["Software Engineer"],
    "posted_at_gte": "2026-04-01",
    "limit": 10
  }'

One endpoint. 350k+ sources. Deduplicated, normalized, enriched. No crawlers, no normalizers, no dedup pipeline to build.

Compare that to the alternative: discovering company slugs across 9+ ATS platforms, building a poller with rate limiting and retries, normalizing schemas, deduplicating across sources, enriching with company data, converting salaries across currencies, and monitoring it all 24/7.

Your engineers' time is your most expensive resource. Spend it on your product, not on plumbing.

Start with 200 free API credits — or explore the data first in our web app without writing any code. See our full pricing for higher volumes.

Build vs Buy: Job Data Aggregation for Your Product

About the Author