Semantic Job Matching AI: 3× Application-to-Hire Rate

The situation before we got involved

Recruitment platforms live and die on match quality. When a candidate searches for a role and the results feel irrelevant, they don't complain - they leave. When a recruiter posts a vacancy and the applicants coming through are poorly suited, they question whether the platform is worth the subscription. Match quality is the product.

This platform was running a keyword-based search layer built on Elasticsearch's standard BM25 scoring. It had worked fine at an earlier stage of the business - when the job listing catalogue was small enough that a handful of irrelevant results in a set of twenty still left plenty of good options. At 40,000 live listings and 200,000 active candidates, the problem had become structural. A candidate searching for "data analyst with Python experience" was receiving listings for data entry clerks, Python developer roles that required a decade of seniority they didn't have, and analyst roles in unrelated sectors - because BM25 matched on the presence of tokens, not on the meaning behind them.

The client's product team had tracked a steady six-month decline in what they called their "application quality score" - an internal metric combining application-to-interview rate and application-to-hire rate across the platform. Both had been falling. Exit survey data pointed directly at search relevance as the primary frustration. The platform had started losing paying recruiter accounts to a competitor that had shipped a semantic search layer the previous year.

"Keyword search worked when we had three thousand listings. At forty thousand, candidates were drowning in noise. We were losing people at the search results page - the most critical moment in the whole product experience."

Why this was harder than it looked

Semantic search in recruitment sounds like a well-trodden path. General-purpose embedding models exist, Elasticsearch has native vector search support, and plenty of tutorials walk through the basic setup. The challenge here was in three specifics that made a generic deployment inadequate.

Domain vocabulary gap in general-purpose models

General-purpose sentence embedding models are trained on broad internet text. They understand that "software engineer" and "developer" are semantically similar. They do not reliably understand that "Class 2 HGV driver" and "category C licence holder" refer to the same role, that "Salesforce administrator" implies CRM platform experience, or that "NMC PIN required" in a healthcare listing signals a regulatory prerequisite that filters out most of the candidate pool. The labour market has a dense domain-specific vocabulary that general models handle poorly. We needed a model fine-tuned on actual job posting and CV data.

Asymmetric query and document length

A candidate search query is typically 3 to 12 words. A job listing is 200 to 800 words. Standard bi-encoder models produce a single embedding for each, which means a short query embedding needs to match against a document embedding that represents a much richer semantic space. Without careful fine-tuning on this asymmetric pairing, the model systematically undervalued short, specific queries - exactly the type most candidates use.

Personalisation without enough signal

The platform wanted matching that improved over time based on candidate engagement - not just returning semantically similar listings to a query, but learning from which results each candidate clicked, applied to, or dismissed. The challenge was that most candidates had limited interaction history: the median candidate had made fewer than four searches before the new system launched. Cold-start personalisation with sparse signal required a different approach to the warm-state personalisation the product team originally had in mind.

How the system works

The production matching pipeline runs in four stages, with a separate personalisation layer that operates in parallel.

Matching Pipeline - Production Simplified

Query Understanding & Expansion

Incoming search queries go through a lightweight NLP pre-processing step before embedding. Named entity recognition identifies role titles, skill names, location references, and seniority signals. A query expansion module appends normalised synonyms - "HGV" becomes "HGV · LGV · large goods vehicle" before the query reaches the encoder. This step alone reduced null-result searches by 34% in A/B testing.

spaCy NER Query expansion Synonym normalisation

Bi-Encoder Semantic Retrieval

The pre-processed query is encoded by a Sentence-BERT model fine-tuned on 1.2 million (query, listing) pairs from the platform's own historical data - labelled by application outcome. The model produces a 768-dimensional query embedding. An approximate nearest-neighbour search across the Elasticsearch vector index returns the top 200 candidates by cosine similarity. At 40k listings, this step completes in under 30ms.

Sentence-BERT fine-tuned HNSW ANN index 768-dim embeddings

Cross-Encoder Re-Ranking

The top 200 candidates from bi-encoder retrieval are re-scored by a cross-encoder model that takes the full (query, listing) pair as input - giving it access to richer interaction signals between query tokens and listing content. The cross-encoder is slower (it cannot be pre-computed) but more accurate for ranking. The top 20 results from this stage are what the candidate actually sees.

Cross-encoder re-rank Top-200 → Top-20 Pair-level scoring

Engagement-Signal Personalisation

A lightweight personalisation layer adjusts the final ranking based on a candidate's engagement history. Listings from sectors, seniority bands, and skill clusters where a candidate has previously clicked or applied carry a small ranking boost. For new candidates with fewer than three interactions, the layer falls back to population-level signals - what candidates with similar profile characteristics engaged with - rather than risking cold-start over-personalisation.

Engagement history Cold-start fallback Redis signal store

Results Delivery & Feedback Loop

Ranked results are served via the FastAPI layer into the existing platform frontend - the product team retained full control over display logic. Every impression, click, application, and dismissal is logged back into the training pipeline via Celery tasks. The fine-tuned model is retrained on a rolling 90-day window every two weeks, meaning match quality continues to improve as the platform accumulates more outcome data.

FastAPI results layer 90-day retraining Outcome feedback loop

What went wrong, and what we learned

Two decisions in the early build phase cost us time and produced worse results than the approach we eventually settled on.

False start #1 - Off-the-shelf embeddings without domain fine-tuning

The first prototype used a general-purpose Sentence-BERT model with no domain adaptation - the logic being that we'd validate the architecture before investing in fine-tuning. In offline evaluation against a held-out labelled set, the general model scored a Normalised Discounted Cumulative Gain (NDCG@10) of 0.61. After fine-tuning on the platform's own historical application data, that figure rose to 0.79 - a substantial and practically meaningful difference at the scale the platform operates. The lesson here isn't surprising in retrospect: recruitment language is specific enough that generic embeddings leave a lot of match quality on the table. Domain fine-tuning should have been a week-one activity, not a later iteration.

False start #2 - Re-ranking the full candidate set

An early design passed all matching listings - sometimes 800 to 1,200 for broad queries - through the cross-encoder re-ranker. The latency was unacceptable: P95 response times exceeded 4 seconds for popular search terms. The two-stage architecture (bi-encoder retrieves 200, cross-encoder re-ranks to 20) reduced P95 latency to 78ms. The quality loss from retrieving 200 rather than all candidates was negligible in practice - the listings that the bi-encoder failed to surface in the top 200 were not listings the cross-encoder would have promoted to the top 20 regardless.

The job listing data quality problem

Roughly 18% of the platform's job listings had titles or descriptions that were poorly written - truncated, missing key skill information, or formatted in ways that confused the encoder. A listing titled "Driver needed ASAP - good pay" carries almost no semantic signal. Rather than trying to fix the encoder's handling of low-quality listings, we built a listing quality scoring system that flags poor listings to the recruiter dashboard with specific improvement suggestions. Listings with quality scores below a threshold were excluded from semantic indexing and surfaced only for exact keyword matches. Recruiter completion rates for flagged listings improved by 41% within six weeks of launch.

The results, six months in

The semantic matching layer went live in Q2 2024. The figures below cover the six months through to Q4 2024, compared to the six-month period immediately before deployment.

Metric

Before (6-month baseline)

After (6 months live)

Application-to-hire rate

8.2%

24.7% (+3×)

Application-to-interview rate

19.4%

41.1%

Candidate search session length

2.1 min avg

4.8 min avg

Zero-result search rate

11.3%

7.4%

Recruiter account churn (monthly)

4.1%

1.8%

Listing quality score (platform avg)

61 / 100

74 / 100

P95 search response time

210ms (keyword only)

78ms (full pipeline)

3× Application-to-hire rate improvement

↑ from 8.2% to 24.7%

78ms P95 search response - full semantic pipeline

↓ from 210ms keyword baseline

1.8% Monthly recruiter churn - down from 4.1%

↓ 56% reduction

What the product team did with the improvement

The match quality gains freed up product bandwidth that had previously been consumed by complaint triage and recruiter retention firefighting. With churn falling and candidate satisfaction scores recovering, the product team shifted focus toward features they'd been unable to prioritise - saved search alerts, proactive candidate outreach for recently posted listings, and a recruiter-facing analytics dashboard surfacing match quality data at the listing level.

The listing quality scoring system - which started as an engineering fix to a data cleanliness problem - became a product feature in its own right. Recruiters responded well to concrete, actionable feedback on their listings. The platform now positions it as a competitive differentiator: "we tell you why your listing isn't converting, and what to change."

The personalisation layer also created a new data asset. The engagement signal store now contains 18 months of timestamped, outcome-labelled interaction data across 200,000 candidates. That dataset has been used to train a salary expectation model and a career trajectory prediction model that the product team plans to surface to candidates as guidance features in 2025.

Three things we'd do differently

Fine-tune on domain data in the first week. The gap between general-purpose embeddings and domain-fine-tuned embeddings in recruitment is large enough that any prototype built without fine-tuning produces misleading quality signals. We wasted two weeks validating architecture decisions against a model that was never going to be the production model. For any NLP project with a large, labelled interaction dataset available, domain adaptation should precede prototyping.

Build the listing quality layer into the original scope. We discovered the data quality problem during development and solved it under time pressure. A listing quality system that feeds recruiter-facing improvement prompts is a better long-term solution than trying to make the model robust to poor input - but it needed to be designed deliberately, not bolted on. For future projects, a data quality audit of the document corpus should happen before architecture decisions are made.

Set engagement-based retraining expectations with the product team early. The model improves as it accumulates outcome data - but the improvement is gradual, and the first two months of live data add the most signal. The product team expected quality to plateau after launch; in practice, application-to-hire rates continued climbing for four months post-launch as the retraining pipeline accumulated better signal. Communicating that trajectory clearly and early would have reduced pressure on the project team during the first weeks of live operation.

Semantic job matching that tripled the application-to-hire rate - across 200k candidates.

The situation before we got involved

Why this was harder than it looked

Domain vocabulary gap in general-purpose models

Asymmetric query and document length

Personalisation without enough signal

How the system works

What went wrong, and what we learned

False start #1 - Off-the-shelf embeddings without domain fine-tuning

False start #2 - Re-ranking the full candidate set

The job listing data quality problem

The results, six months in

What the product team did with the improvement

Three things we'd do differently

Have a matching or search product where
relevance is the core problem?

Semantic job matching that tripled the application-to-hire rate - across 200k candidates.

The situation before we got involved

Why this was harder than it looked

Domain vocabulary gap in general-purpose models

Asymmetric query and document length

Personalisation without enough signal

How the system works

What went wrong, and what we learned

False start #1 - Off-the-shelf embeddings without domain fine-tuning

False start #2 - Re-ranking the full candidate set

The job listing data quality problem

The results, six months in

What the product team did with the improvement

Three things we'd do differently

Have a matching or search product whererelevance is the core problem?

Have a matching or search product where
relevance is the core problem?