The situation before we got involved
Law firms do a lot of work that doesn't feel like legal work. Due diligence in M&A transactions, private equity deals, and commercial finance means reading hundreds - sometimes thousands - of contracts and extracting specific clause-level information: indemnity caps, change of control triggers, assignment restrictions, termination rights, governing law.
For this particular firm, that meant a team of roughly 40 junior associates and paralegals whose primary job was reading contracts and filling in review matrices. Each document took an average of 3.2 hours to review. At 80,000 contracts annually across 14 jurisdictions, the maths was punishing: the manual review function cost $4.2 million per year and was still a bottleneck during busy deal periods.
The quality problem was arguably worse than the cost problem. Different reviewers interpreted the same clause language differently. A limitation of liability clause structured one way in English law looks materially different to the same economic protection drafted under New York law - and junior reviewers, especially those less experienced in a particular jurisdiction, made errors that seniors had to catch during QA. The QA layer itself consumed significant partner time.
"We weren't looking for AI to do the legal thinking. We were looking for AI to handle the reading - so our lawyers could focus on the judgment calls that actually needed a lawyer."
Why this was harder than it looked
Contract review sounds like a natural fit for a language model. You have a document; you want to extract specific information. In practice, three things make it genuinely difficult at production scale.
Jurisdiction sensitivity
The same clause concept - say, "material adverse change" - is defined, interpreted, and litigated differently across English law, US law, German law, Hong Kong law, and UAE law. A system that treats all contracts as if they follow the same conventions will produce unreliable output. The model needed to understand not just what a clause said, but what it meant in the context of the governing law.
Document structure variation
Contracts don't have a standard format. A Share Purchase Agreement from a magic circle firm looks nothing like one drafted by a mid-market US firm, which looks nothing like a deal document from a Gulf-based practice. Chunking strategy that works well on one document type fails on another. We'd seen this in earlier experiments where naive chunking lost clause context that only became clear when reading the surrounding provisions.
Precision over recall
In most RAG applications, missing a result is annoying. In legal review, missing a change of control trigger in an acquisition target's key contracts can cost a client millions. The system needed to be tuned for precision - and where it wasn't confident, it needed to say so clearly rather than generating a plausible-sounding but incorrect answer.
How the system works
The architecture has five distinct stages, each of which we iterated on significantly before reaching the production version.
What went wrong, and what we learned
We had two false starts before getting the architecture right. Documenting them is more useful than pretending the path was straight.
False start #1 - Single vector store for all jurisdictions
The first version used one Pinecone index for everything. Retrieval quality was acceptable for English law contracts (which dominated the training data) and noticeably worse for UAE and Hong Kong law documents. The model was essentially answering questions about DIFC contracts by retrieving semantically similar English law precedents - plausible-looking answers that were jurisdictionally wrong. Splitting into 14 jurisdiction-specific indexes resolved this at the cost of four weeks of re-indexing and additional monthly infrastructure spend.
False start #2 - Semantic chunking
Standard semantic chunking breaks a document into roughly equal-sized chunks based on embedding similarity. For contract review this was a mistake: a limitation of liability clause is often split across three or four sentences that individually look like boilerplate but collectively define the economic protection. The clause-boundary detection model we trained resolved this, but it required annotating 50,000 contracts by hand - a six-week effort that we'd underscoped in the original project plan.
The escalation threshold calibration
Setting the confidence threshold for human escalation required careful calibration against the client's risk tolerance. At 90% confidence (escalate anything below), 22% of documents were going to human review - too high for the system to provide meaningful cost savings. At 98%, the escalation rate dropped to 1.8%, but the system was occasionally auto-closing contracts with genuinely ambiguous indemnity structures. We settled on 94% with clause-type-specific overrides: change of control triggers and limitation of liability clauses always go to human review if confidence is below 97%, regardless of the overall document score.
The results, twelve months in
The system went live in Q2 2024. The numbers below are from the twelve months ending Q1 2025.
What happened to the 40-person team
This question comes up in every conversation about this project, and it's worth addressing directly.
The 40 reviewers were not made redundant. The firm's deal volume had been constrained by review capacity - they were turning away work during busy periods because the team couldn't process it fast enough. With that constraint removed, the firm took on more transactions. The team shifted from mechanical document review to the higher-value advisory work that the AI system's escalation queue generates: interpreting genuinely ambiguous clauses, advising clients on risk tolerance, and drafting negotiation positions on flagged provisions.
The 4% of contracts that escalate to human review are the hardest 4% - the documents with unusual structures, novel clause language, or cross-jurisdictional complexity that genuinely benefit from an experienced lawyer's judgment. The team that previously spent 96% of their time doing work a computer can now do spends all of their time doing work that actually needs them.
Three things we'd do differently
Start the fine-tuning data collection earlier. The clause-boundary annotation work and the generation fine-tune data both required significant client involvement to get right. We started those conversations late and they became the critical path. For a project like this, annotation effort should be scoped and resourced in week one.
Build the escalation UI in parallel with the core system. We treated the lawyer-facing escalation portal as a phase two deliverable and built it after the core pipeline was complete. In practice, the escalation UX directly affected how lawyers calibrated their trust in the system - the better the portal, the more useful the feedback loop for threshold tuning. It should have been a day-one priority.
Price the infrastructure more conservatively at the outset. 14 Pinecone indexes plus GPT-4o fine-tune inference at 80,000 contracts per year adds up to meaningful monthly spend. The ROI is compelling at scale, but the infrastructure cost estimate in the initial proposal was optimistic. We'd price this more carefully now.