---
title: Data workflow
description: Learn how we transform raw job postings into high-quality, normalized data through our 6-step workflow covering crawling, cleaning, extraction, deduplication, enrichment, and quality control
url: https://theirstack.com/en/docs/data/job/data-workflow
---

## How it works

When you use our [API](/en/docs/api-reference), [App](/en/docs/app), or [Datasets](/en/docs/datasets), you're accessing data directly from our database, ensuring you always have the most up-to-date information available. Instead of making live calls to sources for each request, we proactively handle critical steps like data extraction, normalization, company enrichment, and quality assurance. This process guarantees both high-quality data and fast response times.

To guarantee the highest possible quality of data, we follow this process:

1.  ### Crawling & ingestion
    
    We continuously crawl different data sources, such as job boards, ATSs, and company websites. The frequency of our scrapers varies, with some running as often as every 10 minutes and others hourly, to ensure our data is always up-to-date.
    
    More details:
    
    -   [Jobs from 336k different websites](/en/docs/data/job/sources)
    -   [Data freshness and scraping frequency](/en/docs/data/job/freshness)
2.  ### Cleaning & standardization
    
    To ensure data quality and consistency across the platform, we normalize all extracted fields.
    
    #### Permanent logo URLs
    
    When scraping job postings, we collect company logos from the original sources. However, these external URLs are unreliable—they frequently break due to token expiration, hotlinking restrictions, or CORS policies. If you tried to use them directly in your application, you'd end up with broken images.
    
    To solve this, we download and permanently host every company logo on our own infrastructure. Each logo gets a stable URL in the format:
    
    ```
    https://media.theirstack.com/company/logo/{company_id}
    ```
    
    These URLs are permanent and publicly accessible, so you can use them directly in your own applications—whether you're building a [job board](/en/docs/guides/backfill-job-board), a CRM integration, or an internal tool. No proxying, no caching on your side, no broken images.
    
    The hosted logo is available via the `logo_url` field in both the [API](/en/docs/api-reference) response and [dataset](/en/docs/datasets) exports.
    
    #### Description formatting & cleanup
    
    Job descriptions come in all shapes and sizes across the web. Some job boards use fancy HTML formatting with bold headers and bullet points, while others keep it simple with plain text. You'll often find random special characters, weird spacing, or formatting quirks that make the data hard to work with.
    
    We clean this up by converting everything into consistent Markdown format, removing HTML tags while keeping important formatting like bullet points, headers, and emphasis. Whether you're building a [job search](/en/docs/app/job-search) tool or analyzing hiring trends, you'll get clean, structured descriptions that are ready to use right away.
    
    **Before:**
    
    ```
    <div class="desc"><b>About the role</b><br><br>We are looking for a
    <span style="color:red">Software Engineer</span> to join our team!!!<br>
    <ul><li>3+ years of experience</li>
    <li>Strong knowledge of&nbsp;&nbsp;React</li></ul></div>
    ```
    
    **After:**
    
    ```
    ## About the role
    
    We are looking for a Software Engineer to join our team!
    
    - 3+ years of experience
    - Strong knowledge of React
    ```
    
    #### Title standardization & cleanup
    
    Job titles vary wildly across sources—some include special characters, excessive capitalization, internal codes, or extra whitespace. A single role might appear as "Sr. Software Engineer (Remote) - #12345" on one board and "Senior Software Engineer" on another.
    
    We standardize titles by stripping special characters, removing internal references, and normalizing spacing to make them uniform and searchable. This ensures consistent filtering and comparison across all job postings regardless of their original source.
    
    | Raw title | Cleaned title |
    | --- | --- |
    | `Sr. Software Engineer (Remote) - #12345` | `Sr. Software Engineer` |
    | `★★ MARKETING MANAGER ★★ - Apply Now!!!` | `Marketing Manager` |
    | `Data Analyst / Business Intelligence` | `Data Analyst / Business Intelligence` |
    
    #### Industry classification
    
    Here's a common problem: the same company gets labeled differently across platforms. A tech startup might be called "Software Development" on LinkedIn, "Technology" on Indeed, and "SaaS" on their own website. This makes it really hard to filter companies by industry or analyze hiring patterns.
    
    We solve this by creating smart connections between different industry terms. So when you search for "Software Development" companies, you'll also catch those labeled as "Technology" or "SaaS" elsewhere. This gives you complete coverage no matter how the original source categorized the company.
    
    Our industry normalization includes:
    
    -   **Standardized categories** – All companies get mapped to consistent industry labels
    -   **Multi-source validation** – We check industry data from multiple sources for accuracy
    -   **Hierarchical organization** – Industries are organized into main categories and subcategories for flexible filtering, using the same standardized values as LinkedIn. You can explore our complete industry catalog through our [API catalog endpoint](/en/docs/api-reference/catalog/get_catalog_industries_v0).
    
    #### Location geocoding
    
    Location data is probably the most frustrating part of job posting aggregation. You'll see "San Francisco, CA" on one platform, "SF, California" on another, and "US-CA-San Francisco" on a third. Some sources use country codes, others use full names, and some throw in postal codes in unpredictable ways. This makes geographic analysis really difficult.
    
    We've built a location parsing system that works like a universal translator for addresses. It doesn't just standardize formats—it also adds missing geographic coordinates and validates addresses to create a complete location profile for every job posting. This turns scattered location fragments into precise, searchable geographic data.
    
    Our location system includes:
    
    -   **City** – The specific city where the job is located
    -   **State** – The full state name (e.g., "California")
    -   **State Code** – The abbreviated state code (e.g., "CA")
    -   **Country Code** – The standardized country code (e.g., "US")
    -   **Address** – The complete street address when available
    -   **Postal Code** – The ZIP code or postal code for the location
    -   **Latitude** – The geographic latitude coordinate for precise mapping
    -   **Longitude** – The geographic longitude coordinate for precise mapping
    
3.  ### Entity & signal extraction
    
    For every job we collect, we extract key entities including the job posting, company, and, when available, the [hiring manager](/en/docs/app/contact-data/hiring-manager). We also identify and extract keywords mentioned in the job title, description, and URL. All detected keywords are indexed in the `keyword_slugs` field, enabling filtering by tech stack or buying intent signals via the [API](/en/docs/api-reference) and [Datasets](/en/docs/datasets). They also roll up to the company level, powering company-level keyword search.
    
    #### Hiring manager identification
    
    When available, we identify the [hiring manager](/en/docs/app/contact-data/hiring-manager) behind each job posting. We extract this information from the job listing metadata, ATS data, and other signals to connect each role with the person responsible for the hire. This lets you reach out directly to the decision-maker instead of going through generic HR channels.
    
    #### Tech stack detection
    
    We detect [technology keywords](/en/docs/data/technographic) such as programming languages, frameworks, databases, cloud platforms, and developer tools mentioned across the job posting. This powers technographic filtering—letting you find companies hiring for specific technologies like [React](/en/technology/react), [Kubernetes](/en/technology/kubernetes), or [Snowflake](/en/technology/snowflake).
    
    #### Buying signal detection
    
    We extract [buying intent topics](/en/docs/data/buying-intent) including regulations, strategic initiatives, equipment, and business themes. These signals reveal what a company is investing in or preparing for, enabling you to identify prospects based on their purchasing intent rather than just their tech stack.
    
4.  ### Deduplication
    
    Most companies use an Applicant Tracking System (ATS) to manage their hiring process. ATSs help streamline the candidate lifecycle, power career pages listing open roles, and sync job postings with major job boards like LinkedIn and Indeed.
    
    When a company posts a job through an ATS, it's common for that listing to appear on multiple job boards simultaneously. As a result, a single position may end up with 3–5 different references across various platforms. Additionally, job boards often scrape and repost listings from each other, further increasing duplication.
    
    Job posting deduplication is a crucial step to avoid having the same job posted multiple times in our database. If you use our data as sales signal, you won't trigger the same signal multiple times. If you use our data to build a job board, you won't have the same job posted multiple times.
    
    We apply both algorithmic techniques and manual checks to eliminate duplicates effectively.
    
5.  ### Enrichment
    
    Our job posting collection gives you a solid foundation, but we don't stop there. We enhance every piece of data through a comprehensive enrichment process that adds valuable context and verified extra details.
    
    This transformation turns basic job listings into actionable intelligence that helps you make better decisions and build more powerful applications.
    
    #### Company enrichment
    
    Each time we collect a job posting, we also extract all available company information. However, in most cases, this initial data is quite limited—usually just the company's name, logo, and domain. While useful as a starting point, this basic information isn't enough to deliver meaningful insights or enable advanced filtering.
    
    To unlock the full potential of our platform, we enrich these company records with a broader and deeper profile—a process we call company enrichment. This allows you to search, filter, and segment companies based on valuable attributes, helping you identify high-quality prospects faster and with greater precision.
    
    Our enrichment process adds the following data points to each company:
    
    -   **Industry** – Understand the market in which the company operates.
    -   **Company Size (Headcount)** – Target organizations based on their scale, from startups to large enterprises.
    -   **Estimated Revenue** – Gauge the financial size of a company to prioritize outreach.
    -   **Funding Details** – See how much capital the company has raised and from which investors.
    -   **Headquarters Location** – Identify geographic focus areas for your go-to-market strategy.
    -   **LinkedIn URL** – Access their LinkedIn profile for further context and contact discovery.
    -   **Company domain** – Find the company's official website.
    -   **Possible domains** – Find all the possible domains for the company.
    -   **Company Description** – Gain a quick overview of the company's mission, products, or services.
    
    To ensure accuracy and broad coverage, we aggregate and validate this data from multiple trusted providers. While we strive for comprehensive enrichment, please note that coverage may vary depending on the availability of external data—some fields may not be available for every company.
    
    #### Salary enrichment
    
    Salary data in job postings presents a unique challenge: it's scattered across different currencies, formats, and regions. A job posting from London might show £45,000, while one from New York shows $65,000, and another from Berlin shows €55,000. Without proper standardization, comparing these salaries becomes nearly impossible for global hiring teams and compensation analysts.
    
    Our salary enrichment system transforms this fragmented data into actionable insights. We automatically detect salary information in job postings, convert all amounts to USD using real-time exchange rates, and preserve the original currency data. This dual approach gives you the flexibility to analyze global compensation trends while maintaining local market context.
    
    Here's what our salary enrichment delivers:
    
    -   **Multi-currency support** – Handle salaries in 50+ currencies with automatic USD conversion
    -   **Dual filtering options** – Search by original currency or standardized USD amounts
    -   **Live exchange rates** – Currency conversions reflect current market conditions when the job was posted
    -   **Range preservation** – Capture both minimum and maximum salary when available
    
    We extract salary ranges in their original local currency and automatically convert them to USD using real-time exchange rates. Each job posting with salary information includes both minimum and maximum salary fields, enabling you to filter and compare compensation across different regions and currencies effectively. The original salary currency is also preserved so both types of filtering can be done.
    
6.  ### Quality control
    
    Data quality isn't a one-time check—it's an ongoing process. Our team combines automated monitoring with manual review to catch issues before they reach your application.
    
    -   **Automated monitoring** – Pipelines track anomalies in real time: unexpected drops in job volume, spikes in duplicates, missing fields, or broken source integrations. Alerts trigger immediately when thresholds are breached.
    -   **Manual review** – Data analysts audit samples daily, verifying that normalization, deduplication, and enrichment are producing accurate results across sources and regions.
    -   **Issue resolution** – When a problem is detected—whether a source changes its HTML structure or an enrichment provider returns stale data—our team investigates, fixes the root cause, and reprocesses affected records.

## FAQ

### Do you track expired jobs?

Not yet. Today, our focus is on collecting and ingesting any kind of job found.

If your use case needs "active-only" jobs, our current recommendation is:

Tracking expirations reliably requires additional follow-up crawling requests and more proxy traffic, which increases infrastructure cost. That extra cost would need to be passed to customers who need this capability.

If this is important for your use case, email us at [hi@theirstack.com](mailto:hi@theirstack.com?subject=Expired%20jobs%20tracking%20interest&body=Hi%20TheirStack%20team%2C%0A%0AWe%20are%20interested%20in%20expired%20jobs%20tracking.%0AWe%20would%20be%20willing%20to%20pay%20an%20extra%20___%25%20for%20this%20capability.%0A%0AUse%20case%3A%20___%0A%0ACompany%3A%20___%0A) and we'll take your request into consideration to prioritize that feature.

-   Filter to jobs posted in the last 1-2 weeks.
-   Use a shorter cutoff (around 1 week since first seen) if you want to minimize publishing jobs that may already be closed.
-   Use a longer cutoff (up to 1 month since first seen) if you want to maximize inventory, accepting that some jobs may already be closed.