AI & LLMs 14 min read

RAG vs Fine-tuning: A Production Engineer's Guide to Choosing the Right Approach

Based on 23 production deployments - including three where we chose the wrong approach and had to change course. The practical guide nobody wrote because most people haven't deployed enough of these systems to know what actually matters.

LLM RAG Fine-tuning Production AI GPT-4o Vector Search
S
Sequere
ADMIN · Sequere

The RAG-vs-fine-tuning question has accumulated a lot of confident, contradictory answers. Most are based on benchmarks from narrow tasks in controlled settings. This article is based on 23 production deployments - including three where we chose the wrong approach, had to explain it to a client, and rebuilt accordingly.

Why the initial choice matters more than most people acknowledge

The common framing is that you can always swap one approach for the other if the first doesn't work. That's technically true in the same way that you can always refactor a shared-schema database to schema-per-tenant - possible, but significantly more expensive once there's real data and real users involved.

RAG and fine-tuning make different assumptions about where knowledge lives and how it gets updated. Those assumptions propagate through your data pipeline, your deployment architecture, and your evaluation setup. Changing direction mid-project typically means rebuilding three to four weeks of work, not one.

Key Insight

Of the 23 deployments we analysed, 19 were correctly served by RAG alone, 2 required fine-tuning, and 2 benefited from a hybrid approach. The fine-tuning cases both involved narrow, high-frequency tasks where output format consistency was critical.

What RAG actually does (and what it doesn't)

Retrieval Augmented Generation keeps model weights unchanged. Instead, at inference time, relevant context is retrieved from an external knowledge store - usually a vector database - and injected into the prompt alongside the user's query. The model reasons over the retrieved content rather than relying exclusively on what it learned during training.

The key property: the knowledge store can be updated without touching the model. Add a new document, update a policy, remove outdated content - all of it happens at the data layer. The model stays the same.

PYTHON - Basic RAG pipeline Copy
# Simplified RAG retrieval step
def rag_query(user_query: str, k: int = 5) -> str:
    # 1. Embed the query
    query_embedding = embedder.encode(user_query)

    # 2. Retrieve top-k relevant chunks
    results = vector_store.similarity_search(
        query_embedding, top_k=k
    )

    # 3. Build context-augmented prompt
    context = "\n\n".join([r.text for r in results])
    prompt = build_prompt(user_query, context)

    # 4. Generate response
    return llm.generate(prompt)

What RAG doesn't fix: output style and format consistency. If you need the model to consistently produce JSON in a specific schema, respond in a particular voice, or follow domain-specific reasoning conventions, retrieval won't enforce that. The model still behaves according to its training.

What fine-tuning actually does (and what it costs)

Fine-tuning updates the model's weights on a curated dataset of examples. The goal is to shift the model's behaviour - its default tone, output format, task-specific reasoning patterns - without changing the underlying architecture. It's the right tool when you need the model to behave differently, not just know more.

Fine-tuning is not a knowledge transfer mechanism. It's a behaviour modification mechanism. If your problem is that the model doesn't know something, fine-tuning on examples of that knowledge will have limited effect compared to just retrieving it at inference time.

— Sequere, ADMIN, Sequere

The real costs of fine-tuning that don't appear in most benchmarks:

  • Data collection and annotation. You need labelled examples of the specific task - typically 500–5,000 high-quality pairs for most fine-tuning use cases. Collecting and cleaning these takes time and domain expertise.
  • Evaluation infrastructure. A fine-tuned model needs a proper evaluation harness that reflects real task performance. The off-the-shelf benchmarks almost never match your actual use case.
  • Retraining when things change. When your requirements change - and they will - you're rebuilding the training pipeline from scratch, not updating a document in a vector store.

Side-by-side: what actually differentiates them in production

Factor RAG Fine-tuning
Knowledge updates Update document store, no model change Requires retraining; high iteration cost
Output format consistency Prompt engineering required; inconsistent Enforced at weight level; very consistent
Domain-specific reasoning style Possible with careful prompting; fragile Baked into model behaviour; robust
Time to first deployment Days to weeks; no training required Weeks to months including data collection
Source attribution Natural - retrieved sources can be cited Not available; model can hallucinate sources
Inference cost Higher per-call cost due to context length Lower if using smaller fine-tuned model
Private data handling Data stays in your store; not in weights Data is encoded in weights; harder to remove

When to choose RAG: the five indicators

These aren't rules. They're patterns we've observed across enough deployments to have some confidence in them.

  1. Your knowledge changes frequently. If the facts your system needs to access change weekly or daily - product documentation, policies, pricing, support knowledge bases - RAG is almost certainly the right call. Retraining a model to update knowledge is the wrong tool for this.
  2. You need source attribution. Legal, compliance, healthcare, and financial applications frequently require that answers be traceable to specific source documents. RAG makes this natural. Fine-tuning makes it essentially impossible.
  3. Your knowledge corpus is large and grows. Vector search handles large corpora gracefully. Fine-tuning on a 100,000-document knowledge base is not practical.
  4. You're working with private or sensitive data. Data embedded in model weights is difficult to audit and impossible to surgically remove. If you need the ability to forget - for GDPR right-to-erasure requests, or security reasons - keep the data in a store you control.
  5. You need to ship quickly and iterate. A basic RAG pipeline can be running in days. Fine-tuning from scratch typically takes 6–10 weeks including data collection, training, and evaluation.

When fine-tuning earns its cost

Fine-tuning is the right tool in a narrower set of circumstances than most people assume, but it's the right tool decisively in those circumstances.

  1. You need strict, consistent output format. If your system needs to produce JSON in a specific schema - or respond in a domain-specific format that breaks with base model behaviour - fine-tuning enforces this at the weight level. Prompt engineering achieves it unreliably.
  2. The task is narrow, high-frequency, and stable. Medical coding, contract clause classification, financial category assignment - narrow classification tasks with stable requirements and high volume are ideal fine-tuning candidates. The cost is justified once, and you get consistent performance without paying for expensive model calls.
  3. Domain-specific reasoning patterns are required. If the task requires reasoning in a way that differs materially from how the base model reasons - clinical logic, legal analysis, financial risk assessment - fine-tuning on expert-labelled examples can internalise those patterns more reliably than prompting.
Common Mistake

Using fine-tuning to teach a model facts it should be retrieving. If your evaluation shows the model is hallucinating domain-specific facts, the answer is almost never more fine-tuning data - it's adding retrieval to the pipeline.

When you need both

Two of our 23 deployments used a hybrid approach - and both were cases where the use case genuinely required it, not cases where we were hedging.

The pattern that justifies hybrid: a narrow, high-frequency classification or formatting task layered on top of a broad knowledge retrieval requirement. In our legal document analysis system, we used RAG to retrieve relevant clauses and jurisdiction-specific rules, then passed the retrieved context to a fine-tuned classification model that assigned risk categories consistently. The fine-tuned model didn't need to know the law - it needed to categorise well-formatted inputs reliably.

Three mistakes we made, and what they cost

These are real - the projects have been anonymised but the mistakes haven't.

Mistake 1: Fine-tuning to solve a retrieval problem. A knowledge management system for a 14,000-person engineering organisation. We fine-tuned a model on the company's documentation corpus rather than building a RAG pipeline, because the initial brief emphasised "making the model feel like it knows the company." Documentation changed faster than we anticipated. Six weeks after launch, the model was consistently citing outdated procedures. Rebuilt with RAG in four weeks. Cost: three months of a senior engineer's time.

Mistake 2: Underestimating evaluation complexity for fine-tuning. Medical coding assistance system for an NHS trust. We got the model choice right (fine-tuning was correct for this use case) but severely underestimated the time required to build an evaluation framework that reflected real clinical performance. Our training metrics looked good; our real-world accuracy was 12 percentage points lower. Lesson: build the evaluation harness before you start training, not after.

Mistake 3: RAG without chunk strategy. A contract analysis system where we implemented naive fixed-size chunking. The retrieval was returning chunks that split important clause context across boundaries, leading to inconsistent outputs on clauses that happened to fall at chunk boundaries. Entirely avoidable with semantic chunking from the start; cost two weeks of debugging that looked like a model problem before we found it in the retrieval layer.

A practical decision framework

If you're standing at the beginning of an LLM project and trying to make this choice, here's how we'd walk through it:

  1. Does your knowledge change more often than you can tolerate retraining? → RAG.
  2. Do you need source attribution for compliance, legal, or audit reasons? → RAG.
  3. Is the task narrow, high-frequency, and does it require strict output format? → Fine-tuning candidate.
  4. Is the core problem that the model doesn't know enough, or that it doesn't behave the way you need it to? → Knowledge problem = RAG. Behaviour problem = fine-tuning.
  5. Do you have the time and data budget to collect 1,000+ high-quality labelled examples? If not → start with RAG, revisit fine-tuning once you have the data.

The default in most cases should be RAG, with fine-tuning introduced when you have a specific, well-understood reason for it and the data to support it. The hybrid approach is genuinely useful in a narrow set of cases, but don't reach for it as a way of avoiding the decision.

If you're uncertain which approach fits your use case, get in touch. We run free 60-minute technical sessions specifically for LLM architecture decisions - the kind of conversation that's worth having before you've committed to an approach.