LLMs, GenAI, agentic
systems - the signal
through the noise.
Practical analysis of where AI engineering is actually heading - written by the engineers using these models in production every week, not analysts reading press releases. Updated monthly. Honest about what works, what doesn't, and what's hype.
May 2025 - five things our team is talking about.
After two years of doubling, window sizes are no longer the differentiator. Retrieval quality is.
Consistent top-3 across all coding benchmarks for 3 months. Clients are noticing.
Multi-step agents hitting 85% task success in controlled settings, much lower in the wild.
Google's long-context performance is now competitive. Worth re-evaluating if you ruled it out 6 months ago.
Marketing cycle noise. Ignore the claims; watch the benchmark trajectories.
What we're thinking about,
and why it matters to engineers.
These aren't conference-circuit summaries. They're written when something genuinely changes in our production systems or our thinking - which is roughly once a month.
The gap between "our agent demo worked perfectly" and "our agent handles 80% of tasks reliably in production" is where most teams are stuck right now. Twelve months of running agents in production across six client deployments taught us more about failure modes than any benchmark ever did.
GPT-4o costs 20× more per token than Llama 3.1 on most hosting providers. For classification and extraction tasks, the performance delta doesn't justify it. We benchmarked 14 common enterprise tasks and found the sweet spot is more nuanced than the "best model for everything" argument suggests.
Read ArticleHow the leading models compare
on the tasks that matter in production.
Updated monthly. Benchmarks are supplemented with our own internal test suite across real enterprise tasks - classification, extraction, code generation, and multi-step reasoning - not just public leaderboard numbers.
Where AI agents are genuinely
useful - and where they're oversold.
The gap between "autonomous agent" as a demo category and "autonomous agent" as something you'd run with production data is larger than most vendor marketing suggests. Here's our honest assessment by agent pattern.
Agent maturity varies wildly by pattern.
The word "agent" covers everything from a simple tool-calling loop to a fully autonomous system making consequential decisions. The maturity and reliability of these patterns are not equal.
The most production-ready agentic pattern. Retrieval-augmented generation for document Q&A, contract analysis, knowledge base querying, and research summarisation is genuinely reliable at scale - when the retrieval layer is well-engineered.
Agents that execute multi-step workflows - extract data from a document, validate it, call an API, update a database - are reliable when steps are well-defined and tool interfaces are stable.
LLM-based classification for email, ticket, and document routing is now reliable enough for production. The key is confidence scoring - auto-route high-confidence cases, surface low-confidence to humans.
Agents that browse the web, synthesise findings, and produce research reports are impressive in demos. In production, factual accuracy, citation quality, and consistent behaviour across runs are still unreliable enough to require human review on all outputs.
Copilot-style assistance and targeted code generation for well-specified functions are production-ready. Full-feature autonomous development (write a complete service, test it, debug it) remains early - useful for accelerating engineers, not replacing them.
Agents making consequential decisions - financial transactions, medical triage, legal recommendations - are not ready for unsupervised production use regardless of benchmark performance. The failure modes are too unpredictable and the cost of errors too high.
Six developments worth following
over the next 12 months.
Not predictions - these are directions we're tracking closely because they have real engineering implications, even if the timeline and magnitude are uncertain.
OpenAI o3-mini and DeepSeek R1 have made reasoning models cheap enough for production. The use case is narrow - complex multi-step problems where chain-of-thought matters - but within that narrow band, performance is meaningfully better.
As generation quality converges across frontier models, retrieval quality is becoming the actual differentiator in RAG applications. The companies investing in hybrid search, reranking, and metadata-filtered retrieval are pulling ahead.
Llama 3.2 running on Apple Silicon M-series chips is already capable enough for many enterprise document processing tasks. The trajectory toward capable on-device inference has significant implications for healthcare, legal, and financial applications.
Anthropic's MCP is gaining traction as a standard for connecting LLMs to external tools and data sources. If it becomes the industry standard, it significantly simplifies multi-vendor agent architectures and reduces integration overhead.
Vision-language models processing documents as images rather than extracted text are handling edge cases that traditional OCR pipelines can't. For complex layouts - invoices, contracts, technical diagrams - the gap is meaningful.
Synthetic data generation for fine-tuning is maturing from experimental to practical for specific domains. The question of whether models trained on AI-generated data degrade over iterations (model collapse) is still actively debated, but early evidence is more optimistic than the pessimistic predictions suggested.
Building something that needs
to actually work in production?
The gap between AI demos and AI that runs reliably in production is where most projects get stuck. We've closed that gap for 20+ teams - from LLM feature design through to production infrastructure and monitoring. If you're at that stage, let's talk.