Insights AI Trends

LLMs, GenAI, agentic
systems - the signal
through the noise.

Practical analysis of where AI engineering is actually heading - written by the engineers using these models in production every week, not analysts reading press releases. Updated monthly. Honest about what works, what doesn't, and what's hype.

Read the Latest Model Comparison

Coverage

LLMs & Foundation Models

GenAI in Production

Agentic Systems

AI Safety & Alignment

Sequere Model Tracker - Production Benchmarks May 2025

Coding

Reasoning

Long Context

Cost/Token

Latency

Coding Performance - HumanEval + Our Internal Suite

Claude

Sonnet 4.5

92/100 - our internal test suite (282 tasks)

92%

GPT

4o (2025-05)

89/100 - our internal test suite (290 tasks)

89%

Gemini

1.5 Pro

87/100 - our internal test suite (295 tasks)

87%

DeepSeek

85/100 - our internal test suite (314 tasks)

85%

Llama

3.1 405B

82/100 - our internal test suite (305 tasks)

82%

Cost per 1M tokens (Input / Output) - May 2025

Claude S4.5

$3 / $15

input / output

GPT-4o

$2.50 / $10

input / output

Gemini 1.5

$1.25 / $5

input / output

DeepSeek R2

$0.55 / $2.19

input / output

Llama 3.1

$0.90 / $0.90

input / output

Mistral L3

$0.45 / $0.45

input / output

May 2025 - five things our team is talking about.

Signal

Context Window Race Slowing

After two years of doubling, window sizes are no longer the differentiator. Retrieval quality is.

Signal

Claude Sonnet 4 Coding Lead

Consistent top-3 across all coding benchmarks for 3 months. Clients are noticing.

Watch

Agent Reliability Still Fragile

Multi-step agents hitting 85% task success in controlled settings, much lower in the wild.

Watch

Gemini 2.0 Catching Up

Google's long-context performance is now competitive. Worth re-evaluating if you ruled it out 6 months ago.

Noise

"AGI by 2026" Headlines

Marketing cycle noise. Ignore the claims; watch the benchmark trajectories.

Latest Analysis

What we're thinking about,
and why it matters to engineers.

These aren't conference-circuit summaries. They're written when something genuinely changes in our production systems or our thinking - which is roughly once a month.

May 2025 · 14 min read

Deep Analysis · Agentic Systems

Why Agentic AI Is Harder Than It Looks - and What's Actually Making It Work in 2025

The gap between "our agent demo worked perfectly" and "our agent handles 80% of tasks reliably in production" is where most teams are stuck right now. Twelve months of running agents in production across six client deployments taught us more about failure modes than any benchmark ever did.

Key points

Tool calling reliability varies wildly by model - Claude and GPT-4o are meaningfully better than alternatives on multi-step chains.

Structured output enforcement (Instructor, Outlines) is non-negotiable for production; unstructured JSON parsing from LLMs fails at scale.

The most successful agents we've built use deterministic routing for known patterns and LLM reasoning only for genuine ambiguity.

Human-in-the-loop checkpoints at confidence thresholds are a feature, not a failure - they're what makes 85% automation achievable.

LLM Engineering

The Cost of Context: Why Smaller Models Win More Often Than You'd Think

April 2025

GPT-4o costs 20× more per token than Llama 3.1 on most hosting providers. For classification and extraction tasks, the performance delta doesn't justify it. We benchmarked 14 common enterprise tasks and found the sweet spot is more nuanced than the "best model for everything" argument suggests.

Model Selection Cost Analysis LLM Routing

Read Article

Model Comparison

How the leading models compare
on the tasks that matter in production.

Updated monthly. Benchmarks are supplemented with our own internal test suite across real enterprise tasks - classification, extraction, code generation, and multi-step reasoning - not just public leaderboard numbers.

Model

Coding

Reasoning

Long Context

Cost Efficiency

Latency

Claude Sonnet 4.5

Anthropic

GPT-4o (2025-05)

OpenAI

Gemini 2.0 Flash

Google

DeepSeek R2

DeepSeek

Llama 3.1 405B

Meta (hosted)

Claude Haiku 4.5

Anthropic

GPT-4o-mini

OpenAI

Mistral Large 3

Mistral

★★★★★ = Excellent for production · ★★★★☆ = Good · ★★★☆☆ = Adequate · ★★☆☆☆ = Use with caution · Based on internal test suite + public benchmarks, updated May 2025.

Agentic Systems

Where AI agents are genuinely
useful - and where they're oversold.

The gap between "autonomous agent" as a demo category and "autonomous agent" as something you'd run with production data is larger than most vendor marketing suggests. Here's our honest assessment by agent pattern.

Agent maturity varies wildly by pattern.

The word "agent" covers everything from a simple tool-calling loop to a fully autonomous system making consequential decisions. The maturity and reliability of these patterns are not equal.

Pattern Maturity - Our Assessment

Single-step tool calling 92%

Document Q&A (RAG) 88%

Classification & routing 90%

Code generation assist 85%

Multi-step structured workflows 72%

Autonomous research agents 58%

Cross-system orchestration 64%

Fully autonomous decision-making 35%

RAG & Document Intelligence

The most production-ready agentic pattern. Retrieval-augmented generation for document Q&A, contract analysis, knowledge base querying, and research summarisation is genuinely reliable at scale - when the retrieval layer is well-engineered.

Production Ready

Chunking strategy matters more than model choice - 80% of RAG quality problems are retrieval problems, not generation problems.

Hybrid search (semantic + BM25) consistently outperforms vector-only retrieval on real-world document corpora.

Reranking with a cross-encoder before generation is worth the added latency on most document Q&A applications.

Metadata filtering is underused - filtering to relevant document subsets before semantic search dramatically improves precision.

Structured Workflow Automation

Agents that execute multi-step workflows - extract data from a document, validate it, call an API, update a database - are reliable when steps are well-defined and tool interfaces are stable.

Early Stage

Classification & Smart Routing

LLM-based classification for email, ticket, and document routing is now reliable enough for production. The key is confidence scoring - auto-route high-confidence cases, surface low-confidence to humans.

Production Ready

Autonomous Research Agents

Agents that browse the web, synthesise findings, and produce research reports are impressive in demos. In production, factual accuracy, citation quality, and consistent behaviour across runs are still unreliable enough to require human review on all outputs.

Experimental

Code Generation & Review

Copilot-style assistance and targeted code generation for well-specified functions are production-ready. Full-feature autonomous development (write a complete service, test it, debug it) remains early - useful for accelerating engineers, not replacing them.

Early Stage

Autonomous Decision-Making

Agents making consequential decisions - financial transactions, medical triage, legal recommendations - are not ready for unsupervised production use regardless of benchmark performance. The failure modes are too unpredictable and the cost of errors too high.

Experimental

On Our Radar

Six developments worth following
over the next 12 months.

Not predictions - these are directions we're tracking closely because they have real engineering implications, even if the timeline and magnitude are uncertain.

Now

Reasoning Models Crossing Into Production

OpenAI o3-mini and DeepSeek R1 have made reasoning models cheap enough for production. The use case is narrow - complex multi-step problems where chain-of-thought matters - but within that narrow band, performance is meaningfully better.

Cost per reasoning token has dropped 10× in 18 months

Latency still too high for latency-sensitive applications

Best fit: complex analysis tasks where 5–15s response is acceptable

Now

Retrieval Quality as the LLM Differentiator

As generation quality converges across frontier models, retrieval quality is becoming the actual differentiator in RAG applications. The companies investing in hybrid search, reranking, and metadata-filtered retrieval are pulling ahead.

Colbert-v2 and BGE-M3 rerankers are worth the added infra complexity

Metadata filtering reduces token consumption significantly

Chunking strategy variance accounts for more perf spread than model variance

6–18 Months

On-Device Frontier Models for Privacy-Sensitive Applications

Llama 3.2 running on Apple Silicon M-series chips is already capable enough for many enterprise document processing tasks. The trajectory toward capable on-device inference has significant implications for healthcare, legal, and financial applications.

Apple Silicon MLX framework reducing deployment friction

Privacy-preserving inference becoming a competitive selling point

Expect on-device models to displace 30–40% of cloud inference by 2027

6–18 Months

MCP (Model Context Protocol) Standardisation

Anthropic's MCP is gaining traction as a standard for connecting LLMs to external tools and data sources. If it becomes the industry standard, it significantly simplifies multi-vendor agent architectures and reduces integration overhead.

Early adoption by major IDE vendors and developer tools

Reduces bespoke tool-calling code in agent implementations

Worth building new agent infrastructure to MCP from the start

6–18 Months

Multimodal Models in Document Processing

Vision-language models processing documents as images rather than extracted text are handling edge cases that traditional OCR pipelines can't. For complex layouts - invoices, contracts, technical diagrams - the gap is meaningful.

Claude's vision capabilities now viable for structured document extraction

Reduces preprocessing pipeline complexity significantly

Higher token cost but lower total cost when OCR pipeline maintenance is factored in

2+ Years

AI-Generated Training Data at Scale

Synthetic data generation for fine-tuning is maturing from experimental to practical for specific domains. The question of whether models trained on AI-generated data degrade over iterations (model collapse) is still actively debated, but early evidence is more optimistic than the pessimistic predictions suggested.

Distillation from large to small models showing strong results

Domain-specific synthetic data quality now competitive with human annotation

Enterprise adoption will accelerate once legal clarity on data provenance improves

Work With Us

Building something that needs
to actually work in production?

The gap between AI demos and AI that runs reliably in production is where most projects get stuck. We've closed that gap for 20+ teams - from LLM feature design through to production infrastructure and monitoring. If you're at that stage, let's talk.

Start a Conversation

Analysis updated monthly by practitioners

Model benchmarks from production deployments

No paywalls, no registration

LLMs, GenAI, agentic systems - the signal through the noise.

What we're thinking about,and why it matters to engineers.

How the leading models compareon the tasks that matter in production.

Where AI agents are genuinelyuseful - and where they're oversold.

Agent maturity varies wildly by pattern.

Six developments worth followingover the next 12 months.

Building something that needsto actually work in production?

LLMs, GenAI, agentic
systems - the signal
through the noise.

What we're thinking about,
and why it matters to engineers.

How the leading models compare
on the tasks that matter in production.

Where AI agents are genuinely
useful - and where they're oversold.

Six developments worth following
over the next 12 months.

Building something that needs
to actually work in production?