Introduction

What is TheirStackProduct updatesBlog

Data

Job Data
Data workflowFreshnessSourcesStatisticsUse cases
Technographic Data
Company Data
Buying Intent Data

Product

App
APIWebhooksDatasets

Pricing and Billing

Pricing
Affiliate Program

Integrations

ClayMakeN8NZapier

Other

Users and Teams
TheirStackTheirStack Logo
Log inSign up
DocumentationAPI ReferenceWebhooksDatasetsMCPGuides
Job Data

Data workflow

Learn how we transform raw job postings into high-quality, normalized data through our 6-step workflow covering crawling, cleaning, extraction, deduplication, enrichment, and quality control

How it works

When you use our API, App, or Datasets, you're accessing data directly from our database, ensuring you always have the most up-to-date information available. Instead of making live calls to sources for each request, we proactively handle critical steps like data extraction, normalization, company enrichment, and quality assurance. This process guarantees both high-quality data and fast response times.

To guarantee the highest possible quality of data, we follow this process:

Crawling & ingestion

We continuously crawl different data sources, such as job boards, ATSs, and company websites. The frequency of our scrapers varies, with some running as often as every 10 minutes and others hourly, to ensure our data is always up-to-date.

More details:

  • Jobs from 327k different websites
  • Data freshness and scraping frequency

Cleaning & standardization

To ensure data quality and consistency across the platform, we normalize all extracted fields.

Entity & signal extraction

For every job we collect, we extract key entities including the job posting, company, and, when available, the hiring manager. We also identify and extract keywords mentioned in the job title, description, and URL. All detected keywords are indexed in the keyword_slugs field, enabling filtering by tech stack or buying intent signals via the API and Datasets. They also roll up to the company level, powering company-level keyword search.

Deduplication

Most companies use an Applicant Tracking System (ATS) to manage their hiring process. ATSs help streamline the candidate lifecycle, power career pages listing open roles, and sync job postings with major job boards like LinkedIn and Indeed.

When a company posts a job through an ATS, it's common for that listing to appear on multiple job boards simultaneously. As a result, a single position may end up with 3–5 different references across various platforms. Additionally, job boards often scrape and repost listings from each other, further increasing duplication.

Job posting deduplication is a crucial step to avoid having the same job posted multiple times in our database. If you use our data as sales signal, you won't trigger the same signal multiple times. If you use our data to build a job board, you won't have the same job posted multiple times.

We apply both algorithmic techniques and manual checks to eliminate duplicates effectively.

Enrichment

Our job posting collection gives you a solid foundation, but we don't stop there. We enhance every piece of data through a comprehensive enrichment process that adds valuable context and verified extra details.

This transformation turns basic job listings into actionable intelligence that helps you make better decisions and build more powerful applications.

Quality control

Data quality isn't a one-time check—it's an ongoing process. Our team combines automated monitoring with manual review to catch issues before they reach your application.

  • Automated monitoring – Pipelines track anomalies in real time: unexpected drops in job volume, spikes in duplicates, missing fields, or broken source integrations. Alerts trigger immediately when thresholds are breached.
  • Manual review – Data analysts audit samples daily, verifying that normalization, deduplication, and enrichment are producing accurate results across sources and regions.
  • Issue resolution – When a problem is detected—whether a source changes its HTML structure or an enrichment provider returns stale data—our team investigates, fixes the root cause, and reprocesses affected records.

FAQ

Do you track expired jobs?

Not yet. Today, our focus is on collecting and ingesting any kind of job found.

If your use case needs "active-only" jobs, our current recommendation is:

Tracking expirations reliably requires additional follow-up crawling requests and more proxy traffic, which increases infrastructure cost. That extra cost would need to be passed to customers who need this capability.

If this is important for your use case, email us at hi@theirstack.com and we'll take your request into consideration to prioritize that feature.

  • Filter to jobs posted in the last 1-2 weeks.
  • Use a shorter cutoff (around 1 week since first seen) if you want to minimize publishing jobs that may already be closed.
  • Use a longer cutoff (up to 1 month since first seen) if you want to maximize inventory, accepting that some jobs may already be closed.

How is this guide?

Last updated on

Job Data

Access millions of real-time job listings from [[total_job_sources]]+ sources across [[n_countries]] countries. Get comprehensive job market data with advanced filters, company insights, and lightning-fast API responses.

Freshness

We discover 86% of jobs same-day and 98% within 48 hours. Learn how our multi-tiered scraping works and why some jobs appear with a delay.

On this page

How it works
Crawling & ingestion
Cleaning & standardization
Entity & signal extraction
Deduplication
Enrichment
Quality control
FAQ
Do you track expired jobs?