Legal Contract Analysis RAG System | Sequere AI Case Study

The situation before we got involved

Law firms do a lot of work that doesn't feel like legal work. Due diligence in M&A transactions, private equity deals, and commercial finance means reading hundreds - sometimes thousands - of contracts and extracting specific clause-level information: indemnity caps, change of control triggers, assignment restrictions, termination rights, governing law.

For this particular firm, that meant a team of roughly 40 junior associates and paralegals whose primary job was reading contracts and filling in review matrices. Each document took an average of 3.2 hours to review. At 80,000 contracts annually across 14 jurisdictions, the maths was punishing: the manual review function cost $4.2 million per year and was still a bottleneck during busy deal periods.

The quality problem was arguably worse than the cost problem. Different reviewers interpreted the same clause language differently. A limitation of liability clause structured one way in English law looks materially different to the same economic protection drafted under New York law - and junior reviewers, especially those less experienced in a particular jurisdiction, made errors that seniors had to catch during QA. The QA layer itself consumed significant partner time.

"We weren't looking for AI to do the legal thinking. We were looking for AI to handle the reading - so our lawyers could focus on the judgment calls that actually needed a lawyer."

Why this was harder than it looked

Contract review sounds like a natural fit for a language model. You have a document; you want to extract specific information. In practice, three things make it genuinely difficult at production scale.

Jurisdiction sensitivity

The same clause concept - say, "material adverse change" - is defined, interpreted, and litigated differently across English law, US law, German law, Hong Kong law, and UAE law. A system that treats all contracts as if they follow the same conventions will produce unreliable output. The model needed to understand not just what a clause said, but what it meant in the context of the governing law.

Document structure variation

Contracts don't have a standard format. A Share Purchase Agreement from a magic circle firm looks nothing like one drafted by a mid-market US firm, which looks nothing like a deal document from a Gulf-based practice. Chunking strategy that works well on one document type fails on another. We'd seen this in earlier experiments where naive chunking lost clause context that only became clear when reading the surrounding provisions.

Precision over recall

In most RAG applications, missing a result is annoying. In legal review, missing a change of control trigger in an acquisition target's key contracts can cost a client millions. The system needed to be tuned for precision - and where it wasn't confident, it needed to say so clearly rather than generating a plausible-sounding but incorrect answer.

How the system works

The architecture has five distinct stages, each of which we iterated on significantly before reaching the production version.

System Architecture - Production Simplified

Document Ingestion & Pre-processing

Contracts arrive as PDFs, Word documents, or scanned images via secure S3 upload. OCR (Tesseract + AWS Textract) handles scanned documents. A document classifier identifies contract type and governing law jurisdiction before anything else happens.

S3 + Textract Tesseract OCR Jurisdiction classifier

Jurisdiction-Aware Chunking

Documents are chunked using a clause-boundary detection model we trained on 50,000 annotated contracts. Unlike semantic chunking, this preserves entire clauses as atomic units - critical for legal accuracy. Each chunk is tagged with its jurisdiction, contract type, and clause category.

Clause-boundary model Jurisdiction tags Metadata enrichment

14 Jurisdiction-Specific Vector Stores

Each jurisdiction has its own Pinecone index, populated with embeddings from that jurisdiction's chunked clauses plus a curated library of precedent language. Queries are routed to the relevant index based on the governing law tag assigned during ingestion.

Pinecone (14 indexes) text-embedding-3-large Precedent library

Fine-Tuned GPT-4o Generation

Retrieval results feed into a GPT-4o model fine-tuned on 12,000 human-reviewed clause extractions. The fine-tune teaches the model legal precision - when to flag uncertainty, how to describe clause risk in legal terms, and critically, when to escalate. Each output includes a confidence score and the source clause with exact page/paragraph reference.

GPT-4o fine-tuned Confidence scoring Source attribution

Human-in-the-Loop Escalation

Contracts where any clause extraction falls below a confidence threshold (or where the model detects genuinely ambiguous language) are routed to a human reviewer queue with pre-populated context. This turns out to be about 4% of all documents - the rest close automatically.

Escalation queue Confidence thresholds Lawyer review portal

What went wrong, and what we learned

We had two false starts before getting the architecture right. Documenting them is more useful than pretending the path was straight.

False start #1 - Single vector store for all jurisdictions

The first version used one Pinecone index for everything. Retrieval quality was acceptable for English law contracts (which dominated the training data) and noticeably worse for UAE and Hong Kong law documents. The model was essentially answering questions about DIFC contracts by retrieving semantically similar English law precedents - plausible-looking answers that were jurisdictionally wrong. Splitting into 14 jurisdiction-specific indexes resolved this at the cost of four weeks of re-indexing and additional monthly infrastructure spend.

False start #2 - Semantic chunking

Standard semantic chunking breaks a document into roughly equal-sized chunks based on embedding similarity. For contract review this was a mistake: a limitation of liability clause is often split across three or four sentences that individually look like boilerplate but collectively define the economic protection. The clause-boundary detection model we trained resolved this, but it required annotating 50,000 contracts by hand - a six-week effort that we'd underscoped in the original project plan.

The escalation threshold calibration

Setting the confidence threshold for human escalation required careful calibration against the client's risk tolerance. At 90% confidence (escalate anything below), 22% of documents were going to human review - too high for the system to provide meaningful cost savings. At 98%, the escalation rate dropped to 1.8%, but the system was occasionally auto-closing contracts with genuinely ambiguous indemnity structures. We settled on 94% with clause-type-specific overrides: change of control triggers and limitation of liability clauses always go to human review if confidence is below 97%, regardless of the overall document score.

The results, twelve months in

The system went live in Q2 2024. The numbers below are from the twelve months ending Q1 2025.

Metric

Before (baseline)

After (12 months live)

Average review time per contract

3.2 hours

8 minutes

Human reviewer requirement

40 FTE

4 FTE (escalations only)

Annual review cost

$4.2M

$0.6M total (inc. infra)

Contracts without human touch

96%

Reviewer consistency score

71% inter-rater

94% vs gold standard

Escalation rate

100%

Throughput during peak deal periods

Bottleneck

Elastic - no ceiling

$3.6M Net cost saving in year one

↑ vs $4.2M baseline cost

96% Contracts auto-closed without escalation

↑ from 0%

8min Average end-to-end review time

↓ from 3.2 hours

What happened to the 40-person team

This question comes up in every conversation about this project, and it's worth addressing directly.

The 40 reviewers were not made redundant. The firm's deal volume had been constrained by review capacity - they were turning away work during busy periods because the team couldn't process it fast enough. With that constraint removed, the firm took on more transactions. The team shifted from mechanical document review to the higher-value advisory work that the AI system's escalation queue generates: interpreting genuinely ambiguous clauses, advising clients on risk tolerance, and drafting negotiation positions on flagged provisions.

The 4% of contracts that escalate to human review are the hardest 4% - the documents with unusual structures, novel clause language, or cross-jurisdictional complexity that genuinely benefit from an experienced lawyer's judgment. The team that previously spent 96% of their time doing work a computer can now do spends all of their time doing work that actually needs them.

Three things we'd do differently

Start the fine-tuning data collection earlier. The clause-boundary annotation work and the generation fine-tune data both required significant client involvement to get right. We started those conversations late and they became the critical path. For a project like this, annotation effort should be scoped and resourced in week one.

Build the escalation UI in parallel with the core system. We treated the lawyer-facing escalation portal as a phase two deliverable and built it after the core pipeline was complete. In practice, the escalation UX directly affected how lawyers calibrated their trust in the system - the better the portal, the more useful the feedback loop for threshold tuning. It should have been a day-one priority.

Price the infrastructure more conservatively at the outset. 14 Pinecone indexes plus GPT-4o fine-tune inference at 80,000 contracts per year adds up to meaningful monthly spend. The ROI is compelling at scale, but the infrastructure cost estimate in the initial proposal was optimistic. We'd price this more carefully now.

Legal contract review at 80,000 documents a year - without 40 analysts.

The situation before we got involved

Why this was harder than it looked

Jurisdiction sensitivity

Document structure variation

Precision over recall

How the system works

What went wrong, and what we learned

False start #1 - Single vector store for all jurisdictions

False start #2 - Semantic chunking

The escalation threshold calibration

The results, twelve months in

What happened to the 40-person team

Three things we'd do differently

Have a document-heavy process
that needs real intelligence applied to it?

Legal contract review at 80,000 documents a year - without 40 analysts.

The situation before we got involved

Why this was harder than it looked

Jurisdiction sensitivity

Document structure variation

Precision over recall

How the system works

What went wrong, and what we learned

False start #1 - Single vector store for all jurisdictions

False start #2 - Semantic chunking

The escalation threshold calibration

The results, twelve months in

What happened to the 40-person team

Three things we'd do differently

Have a document-heavy processthat needs real intelligence applied to it?

Have a document-heavy process
that needs real intelligence applied to it?