The situation before we got involved
Recruitment platforms live and die on match quality. When a candidate searches for a role and the results feel irrelevant, they don't complain - they leave. When a recruiter posts a vacancy and the applicants coming through are poorly suited, they question whether the platform is worth the subscription. Match quality is the product.
This platform was running a keyword-based search layer built on Elasticsearch's standard BM25 scoring. It had worked fine at an earlier stage of the business - when the job listing catalogue was small enough that a handful of irrelevant results in a set of twenty still left plenty of good options. At 40,000 live listings and 200,000 active candidates, the problem had become structural. A candidate searching for "data analyst with Python experience" was receiving listings for data entry clerks, Python developer roles that required a decade of seniority they didn't have, and analyst roles in unrelated sectors - because BM25 matched on the presence of tokens, not on the meaning behind them.
The client's product team had tracked a steady six-month decline in what they called their "application quality score" - an internal metric combining application-to-interview rate and application-to-hire rate across the platform. Both had been falling. Exit survey data pointed directly at search relevance as the primary frustration. The platform had started losing paying recruiter accounts to a competitor that had shipped a semantic search layer the previous year.
"Keyword search worked when we had three thousand listings. At forty thousand, candidates were drowning in noise. We were losing people at the search results page - the most critical moment in the whole product experience."
Why this was harder than it looked
Semantic search in recruitment sounds like a well-trodden path. General-purpose embedding models exist, Elasticsearch has native vector search support, and plenty of tutorials walk through the basic setup. The challenge here was in three specifics that made a generic deployment inadequate.
Domain vocabulary gap in general-purpose models
General-purpose sentence embedding models are trained on broad internet text. They understand that "software engineer" and "developer" are semantically similar. They do not reliably understand that "Class 2 HGV driver" and "category C licence holder" refer to the same role, that "Salesforce administrator" implies CRM platform experience, or that "NMC PIN required" in a healthcare listing signals a regulatory prerequisite that filters out most of the candidate pool. The labour market has a dense domain-specific vocabulary that general models handle poorly. We needed a model fine-tuned on actual job posting and CV data.
Asymmetric query and document length
A candidate search query is typically 3 to 12 words. A job listing is 200 to 800 words. Standard bi-encoder models produce a single embedding for each, which means a short query embedding needs to match against a document embedding that represents a much richer semantic space. Without careful fine-tuning on this asymmetric pairing, the model systematically undervalued short, specific queries - exactly the type most candidates use.
Personalisation without enough signal
The platform wanted matching that improved over time based on candidate engagement - not just returning semantically similar listings to a query, but learning from which results each candidate clicked, applied to, or dismissed. The challenge was that most candidates had limited interaction history: the median candidate had made fewer than four searches before the new system launched. Cold-start personalisation with sparse signal required a different approach to the warm-state personalisation the product team originally had in mind.
How the system works
The production matching pipeline runs in four stages, with a separate personalisation layer that operates in parallel.
What went wrong, and what we learned
Two decisions in the early build phase cost us time and produced worse results than the approach we eventually settled on.
False start #1 - Off-the-shelf embeddings without domain fine-tuning
The first prototype used a general-purpose Sentence-BERT model with no domain adaptation - the logic being that we'd validate the architecture before investing in fine-tuning. In offline evaluation against a held-out labelled set, the general model scored a Normalised Discounted Cumulative Gain (NDCG@10) of 0.61. After fine-tuning on the platform's own historical application data, that figure rose to 0.79 - a substantial and practically meaningful difference at the scale the platform operates. The lesson here isn't surprising in retrospect: recruitment language is specific enough that generic embeddings leave a lot of match quality on the table. Domain fine-tuning should have been a week-one activity, not a later iteration.
False start #2 - Re-ranking the full candidate set
An early design passed all matching listings - sometimes 800 to 1,200 for broad queries - through the cross-encoder re-ranker. The latency was unacceptable: P95 response times exceeded 4 seconds for popular search terms. The two-stage architecture (bi-encoder retrieves 200, cross-encoder re-ranks to 20) reduced P95 latency to 78ms. The quality loss from retrieving 200 rather than all candidates was negligible in practice - the listings that the bi-encoder failed to surface in the top 200 were not listings the cross-encoder would have promoted to the top 20 regardless.
The job listing data quality problem
Roughly 18% of the platform's job listings had titles or descriptions that were poorly written - truncated, missing key skill information, or formatted in ways that confused the encoder. A listing titled "Driver needed ASAP - good pay" carries almost no semantic signal. Rather than trying to fix the encoder's handling of low-quality listings, we built a listing quality scoring system that flags poor listings to the recruiter dashboard with specific improvement suggestions. Listings with quality scores below a threshold were excluded from semantic indexing and surfaced only for exact keyword matches. Recruiter completion rates for flagged listings improved by 41% within six weeks of launch.
The results, six months in
The semantic matching layer went live in Q2 2024. The figures below cover the six months through to Q4 2024, compared to the six-month period immediately before deployment.
What the product team did with the improvement
The match quality gains freed up product bandwidth that had previously been consumed by complaint triage and recruiter retention firefighting. With churn falling and candidate satisfaction scores recovering, the product team shifted focus toward features they'd been unable to prioritise - saved search alerts, proactive candidate outreach for recently posted listings, and a recruiter-facing analytics dashboard surfacing match quality data at the listing level.
The listing quality scoring system - which started as an engineering fix to a data cleanliness problem - became a product feature in its own right. Recruiters responded well to concrete, actionable feedback on their listings. The platform now positions it as a competitive differentiator: "we tell you why your listing isn't converting, and what to change."
The personalisation layer also created a new data asset. The engagement signal store now contains 18 months of timestamped, outcome-labelled interaction data across 200,000 candidates. That dataset has been used to train a salary expectation model and a career trajectory prediction model that the product team plans to surface to candidates as guidance features in 2025.
Three things we'd do differently
Fine-tune on domain data in the first week. The gap between general-purpose embeddings and domain-fine-tuned embeddings in recruitment is large enough that any prototype built without fine-tuning produces misleading quality signals. We wasted two weeks validating architecture decisions against a model that was never going to be the production model. For any NLP project with a large, labelled interaction dataset available, domain adaptation should precede prototyping.
Build the listing quality layer into the original scope. We discovered the data quality problem during development and solved it under time pressure. A listing quality system that feeds recruiter-facing improvement prompts is a better long-term solution than trying to make the model robust to poor input - but it needed to be designed deliberately, not bolted on. For future projects, a data quality audit of the document corpus should happen before architecture decisions are made.
Set engagement-based retraining expectations with the product team early. The model improves as it accumulates outcome data - but the improvement is gradual, and the first two months of live data add the most signal. The product team expected quality to plateau after launch; in practice, application-to-hire rates continued climbing for four months post-launch as the retraining pipeline accumulated better signal. Communicating that trajectory clearly and early would have reduced pressure on the project team during the first weeks of live operation.