Production LLM Architecture: Latency, Cost, and the Reliability Triangle

Contents preview

01 The Reliability Triangle - Framing the Trade-off p.05

02 Latency Benchmarks: 12 Task Categories, 3 Models p.12

03 Model Routing by Task Complexity p.24

04 Prompt Caching: Implementation & Measured Savings p.34

05 Fallback Chain Architecture & Failure Modes p.42

06 Cost Attribution per Business Unit p.48

Abstract

Deploying a large language model behind a prototype endpoint is straightforward. Deploying one - or several - in a production environment that serves real users, generates real revenue, and carries real SLA obligations is a materially different problem. This paper documents the architecture decisions, tooling choices, and measured outcomes from doing the latter across a range of enterprise deployments between mid-2023 and late 2024.

The central tension in production LLM systems is what we call the reliability triangle: latency, cost, and output quality are three variables that cannot all be simultaneously maximised. Optimising for one moves one or both of the others in an unfavourable direction. Improving output quality by using a larger model increases cost and frequently increases latency. Reducing cost by routing simpler tasks to a smaller model risks quality degradation if the routing logic misjudges task complexity. The architecture decisions documented in this paper are, at root, decisions about where to sit on this triangle for a given use case - and how to build systems that can adjust that position dynamically as usage patterns and model capabilities evolve.

The paper presents p50 and p99 latency benchmarks for GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro across 12 enterprise task categories, measured from our production infrastructure rather than from provider benchmarking pages. It then covers model routing, prompt caching, fallback chain design, and cost attribution in the depth required to implement each in a production system.

"Every team we work with underestimates the cost of LLM inference at scale and overestimates how often they need the frontier model. The routing problem is the highest-leverage engineering decision in an enterprise LLM deployment."

The reliability triangle explained

The three vertices of the reliability triangle are not equal in how they affect different types of LLM features. A legal document review tool has a fundamentally different tolerance profile from a real-time chat assistant. Understanding which vertex your product feature prioritises is the starting point for every architecture decision that follows.

Latency-sensitive features - real-time chat, autocomplete, interactive Q&A - require p99 response times below 2–3 seconds to maintain usable UX. These features frequently tolerate slightly lower output quality and are typically poor candidates for frontier models without significant caching.
Quality-sensitive features - contract analysis, regulatory compliance checking, clinical summarisation - can tolerate latency in the 5–30 second range but require output quality high enough to be used without systematic expert review of every result. These are the features most likely to justify frontier model cost.
Cost-sensitive features - bulk document processing, batch classification, report generation run overnight - have neither strict latency requirements nor the user-facing quality bar of real-time interactions. Routing these tasks to capable smaller models typically reduces cost by 70–85% with no meaningful degradation in measured output quality.

Most enterprise LLM products contain all three categories of feature. The mistake we most commonly see is a single model choice applied uniformly across the product, usually the highest-quality model the team has evaluated - which produces a system that is correct on quality-sensitive tasks but unnecessarily expensive and sometimes unnecessarily slow on everything else.

Latency benchmarks across 12 task categories

The benchmark data in this section was collected from our own production infrastructure over a 90-day measurement window ending November 2024. All measurements are end-to-end latency from the point the API request leaves our application server to the point the complete response is received - inclusive of network transit, queue wait, and provider-side inference time. Streaming responses are measured to last-token receipt.

p50 / p99 Latency by Model & Task Category (ms) 90-day production data · Nov 2024

Task Category	GPT-4o p50 / p99	Claude 3.5 Sonnet p50 / p99	Gemini 1.5 Pro p50 / p99
Short-form generation (<200 tok output)	820 / 1,940	690 / 1,610	740 / 1,780
Long-form generation (>1k tok output)	4,200 / 9,800	3,600 / 8,400	3,900 / 9,100
Document summarisation (5–15 pages)	3,100 / 7,200	2,700 / 6,500	2,900 / 6,900
Structured data extraction (JSON)	1,100 / 2,600	940 / 2,200	1,050 / 2,500
Code generation (function-level)	1,800 / 4,200	1,500 / 3,600	1,700 / 4,000
RAG-augmented Q&A (4-chunk context)	1,600 / 3,800	1,350 / 3,200	1,480 / 3,600
Classification (multi-label)	580 / 1,380	490 / 1,180	540 / 1,300
Sentiment analysis (paragraph)	430 / 1,020	370 / 890	400 / 960
Translation (1,000-word document)	2,600 / 6,100	2,200 / 5,300	2,400 / 5,700
Entity extraction (legal contract)	2,100 / 5,000	1,800 / 4,300	1,950 / 4,700
Tool/function calling (single turn)	1,300 / 3,100	1,100 / 2,700	1,200 / 2,900
Multi-turn conversation (10-turn avg)	920 / 2,200	780 / 1,900	860 / 2,050

Claude 3.5 Sonnet returned the lowest p50 and p99 latency in 11 of the 12 task categories in our production measurement period. The margin over GPT-4o was most pronounced on document-heavy tasks - summarisation, entity extraction, and long-form generation - and smallest on short classification and sentiment tasks where all three models performed within measurement noise of each other. The paper includes discussion of the methodological limitations of these measurements and the conditions under which provider performance profiles have shifted during the measurement window.

Model routing by task complexity

Model routing is the practice of directing each LLM request to the most appropriate model for that request's specific characteristics - rather than sending everything to a single model. Done well, it produces a system that spends frontier-model budget only on tasks that genuinely require frontier-model capability, while handling the majority of volume on capable smaller models at a fraction of the cost.

The routing framework documented in this paper uses a four-signal classification layer that runs ahead of the LLM call itself. Each incoming request is scored on task complexity, output quality requirement, latency budget, and context length. The combination of these four scores maps to one of three routing tiers.

Model Routing Decision Flow Production Architecture

Request Classification (pre-routing layer)

Every incoming request is scored across four dimensions before any LLM call is made: task complexity (1–5 scale, inferred from prompt structure and context length), output quality requirement (inferred from feature type and any explicit quality signals), latency budget (from the calling feature's SLA class), and context length. This classification step runs in ~12ms median using a lightweight fine-tuned encoder - fast enough to be invisible to the user.

Encoder classifier ~12ms overhead 4 dimensions scored

Tier Assignment

Scores map deterministically to one of three tiers. Tier 1 (complexity ≤2, standard quality, latency-insensitive) routes to capable small models - currently Haiku 3.5 and GPT-4o-mini. Tier 2 (complexity 3, or any elevated quality signal) routes to mid-tier models. Tier 3 (complexity 4–5, or explicit high-quality requirement, or context >32k tokens) routes to frontier models. In our production mix, 58% of volume routes to Tier 1, 29% to Tier 2, and 13% to Tier 3.

58% → Tier 1 29% → Tier 2 13% → Tier 3

Quality Gate (post-inference)

A subset of Tier 1 and Tier 2 responses passes through an automated quality gate before being returned to the calling feature. The gate uses a fast scoring model to check for completeness, factual consistency against any provided context, and format compliance. Responses scoring below threshold are automatically re-routed to the next tier up. Re-routing adds latency (typically 800–1,200ms) but occurs on fewer than 2.1% of requests in steady state.

<2.1% re-route rate Consistency check Format compliance

Response Delivery + Cost Logging

Responses are returned to the calling feature with a routing metadata header that includes the tier used, model used, token counts (prompt + completion + cached), and cost attribution tags. Every request is logged to a cost attribution store keyed by business unit, feature name, and user segment. This data feeds the cost dashboard covered in Section 6 and the retraining signal for the routing classifier.

Per-request cost log BU attribution tags Routing metadata header

Fallback Chain (on provider error or timeout)

Provider failures and timeouts trigger the fallback chain rather than returning an error. The fallback chain is ordered by cost-per-token ascending within the same tier - so a Tier 3 frontier model failure falls back to the other Tier 3 provider rather than dropping a tier and risking quality. If all providers in a tier are unavailable, the request is held in a 30-second retry queue before falling to the next tier with a degraded-quality flag.

Same-tier fallback first 30s retry queue Degraded-quality flag

Prompt caching - implementation and measured savings

Prompt caching allows large, repeated sections of a prompt - system instructions, document context, tool definitions - to be stored provider-side so that subsequent requests referencing the same content pay only for the incremental tokens, not for re-processing the cached portion. Both Anthropic and OpenAI support prompt caching variants with different minimum cache block sizes and cache lifetime policies. The implementation details matter: a poorly structured prompt that changes frequently in its cached portion will achieve near-zero cache hit rate and no cost benefit.

Where caching produces the largest savings

In our deployments, caching delivers the largest savings on three request patterns: document Q&A sessions where the same document is referenced across multiple turns, RAG-augmented generation where the same retrieved chunks appear repeatedly within a session, and agentic tool use where the full tool specification is included in every request in a multi-step chain. For a 20-page document Q&A product, structuring prompts to maximise cache hit rate reduced per-session token cost by 48% in our measurement period.

Python Anthropic prompt caching - cache control blocks (simplified)

# Structure your messages so the stable, expensive context
# comes first and is tagged for caching.

import anthropic

client = anthropic.Anthropic()

# System prompt + large document context - both cacheable
system_with_doc = [
    {
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral"},  # cache this block
    },
    {
        "type": "text",
        "text": f"{document_text}",
        "cache_control": {"type": "ephemeral"},  # cache the doc too
    },
]

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=system_with_doc,
    messages=[{"role": "user", "content": user_question}],
)

# Check cache performance in usage stats
usage = response.usage
cache_hit  = usage.cache_read_input_tokens   # tokens read from cache (cheap)
cache_miss = usage.cache_creation_input_tokens # tokens written to cache (standard rate)
fresh_tok  = usage.input_tokens              # uncached input tokens

print(f"Cache hit: {cache_hit} · Miss: {cache_miss} · Fresh: {fresh_tok}")

Measured outcomes - six months of production data

The figures below represent aggregate outcomes across the deployments covered in this paper, measured over the six-month period from June to November 2024. Individual deployment results varied; the paper includes per-deployment breakdowns and discusses the factors that produced above- and below-average outcomes.

61% Median reduction in inference cost via model routing

↓ vs single frontier model baseline

38% Average token spend reduction from prompt caching

↓ across document-heavy deployments

99.6% Feature availability over 6 months with fallback chains active

↑ from 97.1% single-provider baseline

Cost attribution per business unit

Without per-request cost attribution, LLM inference becomes an unmanaged shared cost - and in enterprises running multiple products or business units on the same LLM infrastructure, that creates predictable problems. Engineering leadership can see total monthly spend climbing but cannot identify which features, teams, or user segments are driving the growth. Features with poor prompt efficiency or unusually high re-route rates remain invisible until a monthly invoice triggers an investigation.

The attribution architecture documented in Section 6 tags every API call at the request level with a structured cost allocation key - combining business unit, product feature, environment (production/staging), and a usage tier that maps to the model routing tier. These tags flow into a cost reporting store that surfaces a live Grafana dashboard showing rolling daily cost by business unit, per-token cost trends by model, cache hit rates by feature, and re-route rate as a proxy for prompt quality issues.

In three of the deployments covered by this paper, the cost attribution dashboard surfaced a feature consuming 28–41% of total LLM spend while accounting for under 3% of user-facing traffic. In each case, the underlying cause was a prompt that included the full document corpus rather than only the retrieved chunks for a given query - a prompt engineering error that the team had not previously had visibility into. The cost attribution layer made the anomaly identifiable in days rather than months.

What we would do differently

Build the routing classifier before you need it. Every team that has implemented model routing after go-live has found it harder than expected - not because the technology is difficult, but because retrofitting routing logic into an existing prompt structure that was designed for a single model often requires re-engineering the prompt itself. Design your prompts and feature interfaces from the start to be routing-compatible: keep the task-specification portion of the prompt separable from the context, and build the classifier training infrastructure in the first sprint.

Instrument cache performance from day one. Cache hit rates degrade silently as product features evolve - a prompt change that moves system instructions below the cached document context can collapse cache performance overnight. Monitoring cache hit rate as a first-class metric, with alerts on degradation, prevents this from becoming a surprise cost event.

Test your fallback chain before you need it. Fallback chains that are configured but never exercised tend to have subtle bugs that only surface under real provider failure conditions - incorrect model name strings, missing API key configuration for the fallback provider, or quality-gate logic that rejects fallback responses because they don't match the format expected from the primary model. Run monthly chaos tests against your fallback chain in a staging environment that mirrors production configuration.