LLMs, GenAI, agentic
systems - the signal
through the noise.

Practical analysis of where AI engineering is actually heading - written by the engineers using these models in production every week, not analysts reading press releases. Updated monthly. Honest about what works, what doesn't, and what's hype.

Coverage
LLMs & Foundation Models
GenAI in Production
Agentic Systems
AI Safety & Alignment

May 2025 - five things our team is talking about.

Signal
Context Window Race Slowing

After two years of doubling, window sizes are no longer the differentiator. Retrieval quality is.

Signal
Claude Sonnet 4 Coding Lead

Consistent top-3 across all coding benchmarks for 3 months. Clients are noticing.

Watch
Agent Reliability Still Fragile

Multi-step agents hitting 85% task success in controlled settings, much lower in the wild.

Watch
Gemini 2.0 Catching Up

Google's long-context performance is now competitive. Worth re-evaluating if you ruled it out 6 months ago.

Noise
"AGI by 2026" Headlines

Marketing cycle noise. Ignore the claims; watch the benchmark trajectories.

How the leading models compare
on the tasks that matter in production.

Updated monthly. Benchmarks are supplemented with our own internal test suite across real enterprise tasks - classification, extraction, code generation, and multi-step reasoning - not just public leaderboard numbers.

Model
Coding
Reasoning
Long Context
Cost Efficiency
Latency
Claude Sonnet 4.5
Anthropic
GPT-4o (2025-05)
OpenAI
Gemini 2.0 Flash
Google
DeepSeek R2
DeepSeek
Llama 3.1 405B
Meta (hosted)
Claude Haiku 4.5
Anthropic
GPT-4o-mini
OpenAI
Mistral Large 3
Mistral
★★★★★ = Excellent for production · ★★★★☆ = Good · ★★★☆☆ = Adequate · ★★☆☆☆ = Use with caution · Based on internal test suite + public benchmarks, updated May 2025.

Where AI agents are genuinely
useful - and where they're oversold.

The gap between "autonomous agent" as a demo category and "autonomous agent" as something you'd run with production data is larger than most vendor marketing suggests. Here's our honest assessment by agent pattern.

Agent maturity varies wildly by pattern.

The word "agent" covers everything from a simple tool-calling loop to a fully autonomous system making consequential decisions. The maturity and reliability of these patterns are not equal.

Pattern Maturity - Our Assessment
Single-step tool calling 92%
Document Q&A (RAG) 88%
Classification & routing 90%
Code generation assist 85%
Multi-step structured workflows 72%
Autonomous research agents 58%
Cross-system orchestration 64%
Fully autonomous decision-making 35%
RAG & Document Intelligence

The most production-ready agentic pattern. Retrieval-augmented generation for document Q&A, contract analysis, knowledge base querying, and research summarisation is genuinely reliable at scale - when the retrieval layer is well-engineered.

Production Ready
Chunking strategy matters more than model choice - 80% of RAG quality problems are retrieval problems, not generation problems.
Hybrid search (semantic + BM25) consistently outperforms vector-only retrieval on real-world document corpora.
Reranking with a cross-encoder before generation is worth the added latency on most document Q&A applications.
Metadata filtering is underused - filtering to relevant document subsets before semantic search dramatically improves precision.
Structured Workflow Automation

Agents that execute multi-step workflows - extract data from a document, validate it, call an API, update a database - are reliable when steps are well-defined and tool interfaces are stable.

Early Stage
Classification & Smart Routing

LLM-based classification for email, ticket, and document routing is now reliable enough for production. The key is confidence scoring - auto-route high-confidence cases, surface low-confidence to humans.

Production Ready
Autonomous Research Agents

Agents that browse the web, synthesise findings, and produce research reports are impressive in demos. In production, factual accuracy, citation quality, and consistent behaviour across runs are still unreliable enough to require human review on all outputs.

Experimental
Code Generation & Review

Copilot-style assistance and targeted code generation for well-specified functions are production-ready. Full-feature autonomous development (write a complete service, test it, debug it) remains early - useful for accelerating engineers, not replacing them.

Early Stage
Autonomous Decision-Making

Agents making consequential decisions - financial transactions, medical triage, legal recommendations - are not ready for unsupervised production use regardless of benchmark performance. The failure modes are too unpredictable and the cost of errors too high.

Experimental

Six developments worth following
over the next 12 months.

Not predictions - these are directions we're tracking closely because they have real engineering implications, even if the timeline and magnitude are uncertain.

Now
Reasoning Models Crossing Into Production

OpenAI o3-mini and DeepSeek R1 have made reasoning models cheap enough for production. The use case is narrow - complex multi-step problems where chain-of-thought matters - but within that narrow band, performance is meaningfully better.

Cost per reasoning token has dropped 10× in 18 months
Latency still too high for latency-sensitive applications
Best fit: complex analysis tasks where 5–15s response is acceptable
Now
Retrieval Quality as the LLM Differentiator

As generation quality converges across frontier models, retrieval quality is becoming the actual differentiator in RAG applications. The companies investing in hybrid search, reranking, and metadata-filtered retrieval are pulling ahead.

Colbert-v2 and BGE-M3 rerankers are worth the added infra complexity
Metadata filtering reduces token consumption significantly
Chunking strategy variance accounts for more perf spread than model variance
6–18 Months
On-Device Frontier Models for Privacy-Sensitive Applications

Llama 3.2 running on Apple Silicon M-series chips is already capable enough for many enterprise document processing tasks. The trajectory toward capable on-device inference has significant implications for healthcare, legal, and financial applications.

Apple Silicon MLX framework reducing deployment friction
Privacy-preserving inference becoming a competitive selling point
Expect on-device models to displace 30–40% of cloud inference by 2027
6–18 Months
MCP (Model Context Protocol) Standardisation

Anthropic's MCP is gaining traction as a standard for connecting LLMs to external tools and data sources. If it becomes the industry standard, it significantly simplifies multi-vendor agent architectures and reduces integration overhead.

Early adoption by major IDE vendors and developer tools
Reduces bespoke tool-calling code in agent implementations
Worth building new agent infrastructure to MCP from the start
6–18 Months
Multimodal Models in Document Processing

Vision-language models processing documents as images rather than extracted text are handling edge cases that traditional OCR pipelines can't. For complex layouts - invoices, contracts, technical diagrams - the gap is meaningful.

Claude's vision capabilities now viable for structured document extraction
Reduces preprocessing pipeline complexity significantly
Higher token cost but lower total cost when OCR pipeline maintenance is factored in
2+ Years
AI-Generated Training Data at Scale

Synthetic data generation for fine-tuning is maturing from experimental to practical for specific domains. The question of whether models trained on AI-generated data degrade over iterations (model collapse) is still actively debated, but early evidence is more optimistic than the pessimistic predictions suggested.

Distillation from large to small models showing strong results
Domain-specific synthetic data quality now competitive with human annotation
Enterprise adoption will accelerate once legal clarity on data provenance improves

Building something that needs
to actually work in production?

The gap between AI demos and AI that runs reliably in production is where most projects get stuck. We've closed that gap for 20+ teams - from LLM feature design through to production infrastructure and monitoring. If you're at that stage, let's talk.

Start a Conversation
Analysis updated monthly by practitioners
Model benchmarks from production deployments
No paywalls, no registration