Why Agentic AI Is Harder Than It Looks - and What's Actually Making It Work in 2025

Key Takeaways - 4 things you need to know

Tool calling reliability varies significantly by model. Claude Sonnet and GPT-4o are meaningfully more reliable than alternatives on multi-step chains - the difference shows up at scale, not in demos.

Structured output enforcement (Instructor, Outlines, or native structured output APIs) is non-negotiable for production. Parsing free-form LLM JSON fails in ways that are difficult to reproduce and expensive to debug.

The most reliable production agents use deterministic routing for known patterns and LLM reasoning only for genuine ambiguity - not LLMs everywhere by default.

Human-in-the-loop checkpoints at confidence thresholds are a feature, not a fallback. They're what makes 80%+ automation achievable without unacceptable error rates in high-stakes workflows.

Twelve months ago, a client demoed their AI agent to us on a call. It parsed an email, identified the relevant database record, updated the status, and sent a confirmation - flawlessly, in about four seconds. They were ready to roll it out to their entire operations team the following week.

Six weeks later, they called us. The agent was failing on roughly 35% of real-world inputs. Not catastrophically - it wasn't deleting records or sending wrong confirmations. It was just quietly routing ambiguous emails to a fallback queue that nobody was monitoring. By the time they noticed, the queue had 1,400 unprocessed items.

This story is not unusual. The gap between "the agent works in our controlled demo environment" and "the agent handles real user inputs reliably at scale" is where most teams get stuck - and where the interesting engineering happens. This article is about that gap.

Why Agent Demos Lie to You

Demos are optimised for the happy path. The email is well-formatted, the intent is unambiguous, the database record exists. In production, none of these are guaranteed simultaneously.

Real-world input distributions have a long tail that demos never surface. In every agent deployment we've run, the top 20 input patterns cover about 60% of volume - and they work fine. It's the remaining 40% that matters, because that's where the agent either handles ambiguity gracefully or fails in ways that require human cleanup.

"The top 20 input patterns cover about 60% of volume and they work fine. It's the remaining 40% that matters."

— From six production deployments, aggregated

The specific failure modes we see repeatedly are:

Ambiguous intent: the user's request could mean multiple things, and the agent picks one without flagging uncertainty.
Missing preconditions: the agent assumes a record exists, an API is available, or a previous step succeeded - and proceeds when it hasn't.
Format variation: real-world data doesn't look like training examples. Dates, names, product codes, and reference numbers come in formats the agent wasn't tested against.
Cascading failures: in multi-step agents, a partial failure in step 2 can corrupt the context passed to step 4 in ways that produce plausible-looking but incorrect outputs.

The silent failure problem: The most dangerous production failure mode isn't a crash - it's an agent that quietly routes items to a fallback queue that nobody is monitoring. Always instrument your fallback paths as aggressively as your happy paths.

Tool Calling: Where Model Differences Actually Show Up

In single-tool, single-step scenarios, most frontier models perform comparably. The divergence shows up when you chain multiple tool calls together, particularly when:

The correct tool isn't obvious from the input alone
Tool outputs need to be interpreted before the next tool call
The agent needs to decide whether to continue or surface a result to the user

The key takeaway: single-step accuracy is high across all frontier models, which is why demos look good. The 5-step accuracy tells a different story - there's a meaningful gap between the top two models and the rest, and that gap compounds with each additional step in a complex workflow.

Structured Outputs Are Not Optional in Production

This is the most consistently underestimated problem in production AI engineering, so I'll state it plainly: do not parse free-form LLM JSON output in production systems.

The failure modes are subtle and time-delayed. An LLM might output valid JSON 99.7% of the time in testing. In production, across millions of requests, that 0.3% failure rate generates thousands of parsing errors per week. Worse, the failures are often correlated - certain input patterns reliably produce malformed output, so they tend to cluster.

Three approaches work well in production. In order of our preference:

Instructor (Python): wraps the Anthropic/OpenAI clients with Pydantic validation and automatic retry on validation failure. The cleanest DX and the most mature library in this space.
Native structured outputs: Anthropic and OpenAI both support structured output modes in their APIs. Less flexible than Instructor but removes a dependency.
Outlines (for open-source models): constrained generation that makes the model physically unable to produce invalid output. Most reliable of the three, but requires hosting the model yourself.

Note on retry strategies

Instructor's default retry-on-validation-error behaviour adds latency when the model produces invalid output. For latency-sensitive applications, cap retries at 1 and route failures to human review rather than retrying indefinitely. In our production deployments, retry-induced latency spikes are more damaging to user experience than the occasional fallback to human review.

The Hybrid Architecture: Deterministic Routing + LLM Reasoning

One of the most impactful changes we've made across multiple agent deployments is moving away from "LLM decides everything" toward "LLM decides only what needs deciding."

Most production workflows have two types of inputs: those with unambiguous intent that can be routed deterministically, and those with genuine ambiguity that benefit from LLM reasoning. Treating all inputs as the latter is slower, more expensive, and less reliable than separating them.

In one deployment, this hybrid architecture reduced LLM calls by 64% without affecting automation rate - because 64% of inputs had unambiguous, pattern-matchable intent. The remaining 36% benefited from LLM reasoning, but routing the full 100% through the LLM was adding cost and latency with no benefit.

Human-in-the-Loop: Design It In, Not Out

The framing of "human-in-the-loop" as a fallback - something that happens when the agent fails - is counterproductive. It leads to under-instrumented fallback queues and operations teams who are surprised by the volume that ends up there.

The better framing: human review is a first-class part of the workflow for inputs that exceed an uncertainty threshold. The agent's job is to classify and prepare those inputs as efficiently as possible, not to avoid human involvement entirely.

In practice, this means:

Define confidence thresholds explicitly before deployment, not reactively after errors appear.
Surface the agent's reasoning in the human review interface - the reviewer should understand why the agent was uncertain, not just that it was.
Monitor the human review queue as carefully as automated outcomes. If the queue is growing, something upstream has changed.
Feed human review decisions back into the system - either as training data for fine-tuning or as new deterministic rules if patterns emerge.

"80% automation rate with a well-monitored human review queue is better than 95% automation rate with an unmonitored one."

What's Actually Working - Patterns Worth Adopting Now

To close this out, here are the specific patterns we've found most consistently valuable across production deployments, stripped of any caveats about what might work someday:

1. Confidence-gated routing with explicit thresholds

Set confidence thresholds before deployment. Adjust them based on real production data. The 0.80 threshold we use in the code examples above is a starting point - the right number depends on the cost of a false positive in your specific domain.

2. Pydantic models for every LLM output boundary

Define a Pydantic model for every structured output you expect from an LLM call. Use Instructor to enforce it. This sounds like overhead until the first time a free-form JSON parser fails silently in production.

3. Observable agents from day one

Every agent call should log: the input, the structured output, the confidence score, the routing decision, and the final outcome. Langfuse is our current tool of choice for this - it's lightweight and the trace viewer is genuinely useful for debugging. You cannot debug a production agent without traces.

4. Separate your retrieval quality from your generation quality

For RAG-based agents, the majority of quality problems are retrieval problems, not generation problems. Before changing the model or the prompt, verify that the right documents are actually being retrieved. Hybrid search (BM25 + semantic) and a cross-encoder reranker will outperform better embeddings alone for most enterprise document corpora.

None of this is glamorous. It's not the part of AI engineering that gets covered in conference keynotes or generates LinkedIn engagement. But it's the difference between an agent that runs reliably in production and one that's permanently in "almost ready to deploy" status.

If you're working through any of these challenges on a specific system, the engineering team at Sequere is available for a free technical call - no pitch, just an honest look at what's likely to move the needle for your specific situation.