Abstract
Deploying a large language model behind a prototype endpoint is straightforward. Deploying one - or several - in a production environment that serves real users, generates real revenue, and carries real SLA obligations is a materially different problem. This paper documents the architecture decisions, tooling choices, and measured outcomes from doing the latter across a range of enterprise deployments between mid-2023 and late 2024.
The central tension in production LLM systems is what we call the reliability triangle: latency, cost, and output quality are three variables that cannot all be simultaneously maximised. Optimising for one moves one or both of the others in an unfavourable direction. Improving output quality by using a larger model increases cost and frequently increases latency. Reducing cost by routing simpler tasks to a smaller model risks quality degradation if the routing logic misjudges task complexity. The architecture decisions documented in this paper are, at root, decisions about where to sit on this triangle for a given use case - and how to build systems that can adjust that position dynamically as usage patterns and model capabilities evolve.
The paper presents p50 and p99 latency benchmarks for GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro across 12 enterprise task categories, measured from our production infrastructure rather than from provider benchmarking pages. It then covers model routing, prompt caching, fallback chain design, and cost attribution in the depth required to implement each in a production system.
"Every team we work with underestimates the cost of LLM inference at scale and overestimates how often they need the frontier model. The routing problem is the highest-leverage engineering decision in an enterprise LLM deployment."
The reliability triangle explained
The three vertices of the reliability triangle are not equal in how they affect different types of LLM features. A legal document review tool has a fundamentally different tolerance profile from a real-time chat assistant. Understanding which vertex your product feature prioritises is the starting point for every architecture decision that follows.
- Latency-sensitive features - real-time chat, autocomplete, interactive Q&A - require p99 response times below 2–3 seconds to maintain usable UX. These features frequently tolerate slightly lower output quality and are typically poor candidates for frontier models without significant caching.
- Quality-sensitive features - contract analysis, regulatory compliance checking, clinical summarisation - can tolerate latency in the 5–30 second range but require output quality high enough to be used without systematic expert review of every result. These are the features most likely to justify frontier model cost.
- Cost-sensitive features - bulk document processing, batch classification, report generation run overnight - have neither strict latency requirements nor the user-facing quality bar of real-time interactions. Routing these tasks to capable smaller models typically reduces cost by 70–85% with no meaningful degradation in measured output quality.
Most enterprise LLM products contain all three categories of feature. The mistake we most commonly see is a single model choice applied uniformly across the product, usually the highest-quality model the team has evaluated - which produces a system that is correct on quality-sensitive tasks but unnecessarily expensive and sometimes unnecessarily slow on everything else.
Latency benchmarks across 12 task categories
The benchmark data in this section was collected from our own production infrastructure over a 90-day measurement window ending November 2024. All measurements are end-to-end latency from the point the API request leaves our application server to the point the complete response is received - inclusive of network transit, queue wait, and provider-side inference time. Streaming responses are measured to last-token receipt.
| Task Category | GPT-4o p50 / p99 | Claude 3.5 Sonnet p50 / p99 | Gemini 1.5 Pro p50 / p99 |
|---|---|---|---|
| Short-form generation (<200 tok output) | 820 / 1,940 | 690 / 1,610 | 740 / 1,780 |
| Long-form generation (>1k tok output) | 4,200 / 9,800 | 3,600 / 8,400 | 3,900 / 9,100 |
| Document summarisation (5–15 pages) | 3,100 / 7,200 | 2,700 / 6,500 | 2,900 / 6,900 |
| Structured data extraction (JSON) | 1,100 / 2,600 | 940 / 2,200 | 1,050 / 2,500 |
| Code generation (function-level) | 1,800 / 4,200 | 1,500 / 3,600 | 1,700 / 4,000 |
| RAG-augmented Q&A (4-chunk context) | 1,600 / 3,800 | 1,350 / 3,200 | 1,480 / 3,600 |
| Classification (multi-label) | 580 / 1,380 | 490 / 1,180 | 540 / 1,300 |
| Sentiment analysis (paragraph) | 430 / 1,020 | 370 / 890 | 400 / 960 |
| Translation (1,000-word document) | 2,600 / 6,100 | 2,200 / 5,300 | 2,400 / 5,700 |
| Entity extraction (legal contract) | 2,100 / 5,000 | 1,800 / 4,300 | 1,950 / 4,700 |
| Tool/function calling (single turn) | 1,300 / 3,100 | 1,100 / 2,700 | 1,200 / 2,900 |
| Multi-turn conversation (10-turn avg) | 920 / 2,200 | 780 / 1,900 | 860 / 2,050 |
Claude 3.5 Sonnet returned the lowest p50 and p99 latency in 11 of the 12 task categories in our production measurement period. The margin over GPT-4o was most pronounced on document-heavy tasks - summarisation, entity extraction, and long-form generation - and smallest on short classification and sentiment tasks where all three models performed within measurement noise of each other. The paper includes discussion of the methodological limitations of these measurements and the conditions under which provider performance profiles have shifted during the measurement window.
Model routing by task complexity
Model routing is the practice of directing each LLM request to the most appropriate model for that request's specific characteristics - rather than sending everything to a single model. Done well, it produces a system that spends frontier-model budget only on tasks that genuinely require frontier-model capability, while handling the majority of volume on capable smaller models at a fraction of the cost.
The routing framework documented in this paper uses a four-signal classification layer that runs ahead of the LLM call itself. Each incoming request is scored on task complexity, output quality requirement, latency budget, and context length. The combination of these four scores maps to one of three routing tiers.
Prompt caching - implementation and measured savings
Prompt caching allows large, repeated sections of a prompt - system instructions, document context, tool definitions - to be stored provider-side so that subsequent requests referencing the same content pay only for the incremental tokens, not for re-processing the cached portion. Both Anthropic and OpenAI support prompt caching variants with different minimum cache block sizes and cache lifetime policies. The implementation details matter: a poorly structured prompt that changes frequently in its cached portion will achieve near-zero cache hit rate and no cost benefit.
Where caching produces the largest savings
In our deployments, caching delivers the largest savings on three request patterns: document Q&A sessions where the same document is referenced across multiple turns, RAG-augmented generation where the same retrieved chunks appear repeatedly within a session, and agentic tool use where the full tool specification is included in every request in a multi-step chain. For a 20-page document Q&A product, structuring prompts to maximise cache hit rate reduced per-session token cost by 48% in our measurement period.
# Structure your messages so the stable, expensive context # comes first and is tagged for caching. import anthropic client = anthropic.Anthropic() # System prompt + large document context - both cacheable system_with_doc = [ { "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}, # cache this block }, { "type": "text", "text": f"{document_text} ", "cache_control": {"type": "ephemeral"}, # cache the doc too }, ] response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, system=system_with_doc, messages=[{"role": "user", "content": user_question}], ) # Check cache performance in usage stats usage = response.usage cache_hit = usage.cache_read_input_tokens # tokens read from cache (cheap) cache_miss = usage.cache_creation_input_tokens # tokens written to cache (standard rate) fresh_tok = usage.input_tokens # uncached input tokens print(f"Cache hit: {cache_hit} · Miss: {cache_miss} · Fresh: {fresh_tok}")
Measured outcomes - six months of production data
The figures below represent aggregate outcomes across the deployments covered in this paper, measured over the six-month period from June to November 2024. Individual deployment results varied; the paper includes per-deployment breakdowns and discusses the factors that produced above- and below-average outcomes.
Cost attribution per business unit
Without per-request cost attribution, LLM inference becomes an unmanaged shared cost - and in enterprises running multiple products or business units on the same LLM infrastructure, that creates predictable problems. Engineering leadership can see total monthly spend climbing but cannot identify which features, teams, or user segments are driving the growth. Features with poor prompt efficiency or unusually high re-route rates remain invisible until a monthly invoice triggers an investigation.
The attribution architecture documented in Section 6 tags every API call at the request level with a structured cost allocation key - combining business unit, product feature, environment (production/staging), and a usage tier that maps to the model routing tier. These tags flow into a cost reporting store that surfaces a live Grafana dashboard showing rolling daily cost by business unit, per-token cost trends by model, cache hit rates by feature, and re-route rate as a proxy for prompt quality issues.
In three of the deployments covered by this paper, the cost attribution dashboard surfaced a feature consuming 28–41% of total LLM spend while accounting for under 3% of user-facing traffic. In each case, the underlying cause was a prompt that included the full document corpus rather than only the retrieved chunks for a given query - a prompt engineering error that the team had not previously had visibility into. The cost attribution layer made the anomaly identifiable in days rather than months.
What we would do differently
Build the routing classifier before you need it. Every team that has implemented model routing after go-live has found it harder than expected - not because the technology is difficult, but because retrofitting routing logic into an existing prompt structure that was designed for a single model often requires re-engineering the prompt itself. Design your prompts and feature interfaces from the start to be routing-compatible: keep the task-specification portion of the prompt separable from the context, and build the classifier training infrastructure in the first sprint.
Instrument cache performance from day one. Cache hit rates degrade silently as product features evolve - a prompt change that moves system instructions below the cached document context can collapse cache performance overnight. Monitoring cache hit rate as a first-class metric, with alerts on degradation, prevents this from becoming a surprise cost event.
Test your fallback chain before you need it. Fallback chains that are configured but never exercised tend to have subtle bugs that only surface under real provider failure conditions - incorrect model name strings, missing API key configuration for the fallback provider, or quality-gate logic that rejects fallback responses because they don't match the format expected from the primary model. Run monthly chaos tests against your fallback chain in a staging environment that mirrors production configuration.