Why Agentic AI Is Harder Than It Looks - and What's Actually Making It Work in 2025

Key Takeaways - 4 things worth knowing before your next model decision

For structured classification, named entity extraction, and intent routing - Llama 3.1 70B and Mistral Medium match GPT-4o accuracy within 3–5% on well-defined tasks, at a fraction of the cost.

The frontier model advantage is real but narrow - it shows up in open-ended reasoning, ambiguous multi-step generation, and tasks with no clean ground truth. Not everywhere.

LLM routing - sending easy tasks to small cheap models and hard tasks to frontier models - reduced inference costs by 61% on one client deployment with no measurable drop in output quality.

The benchmark you run internally on your data matters more than public leaderboard rankings. Task-specific evals on representative samples are the only reliable guide to model selection.

There's a default assumption that has cost several of our clients a significant amount of money: that the best model for any given task is the most capable frontier model available. It's understandable - when GPT-4o or Claude Opus gets the job done reliably, reaching for a cheaper alternative feels like a risk not worth taking.

The reality we've observed across production deployments is different. Most enterprise AI workloads are not "use the best possible model" problems. They're "use the right model for this specific task" problems - and getting that distinction right is worth real money. At 40 million tokens per day across a typical enterprise deployment, the difference between GPT-4o and a well-configured Llama 3.1 70B is not a rounding error.

This article documents what we found when we benchmarked 14 common enterprise tasks across six models - not on public leaderboards, but on real client data with ground truth labels.

The Cost Gap Is Larger Than Most Teams Realise

Let's start with the numbers, because the magnitude of the difference is easy to underestimate when you're thinking in cents per thousand tokens rather than dollars per production month.

At the time of writing, GPT-4o (input) costs approximately $5 per million tokens via the OpenAI API. Llama 3.1 70B self-hosted on a reasonable cloud GPU configuration costs roughly $0.20–0.35 per million tokens including compute. Llama 3.1 8B drops that to under $0.10. That's a 15–50× cost difference depending on how you count infrastructure, not the 20× headline number - it can be worse.

Model	Provider / Hosting	Approx. Input Cost (per 1M tokens)	Relative to GPT-4o
GPT-4o	OpenAI API	$5.00	1× (baseline)
Claude Sonnet 3.5	Anthropic API	$3.00	0.6×
Claude Haiku 3.5	Anthropic API	$0.80	0.16×
Llama 3.1 70B	Self-hosted / Together AI	~$0.25–0.40	~0.06×
Llama 3.1 8B	Self-hosted / Together AI	~$0.08–0.12	~0.02×
Mistral 7B Instruct	Self-hosted / Mistral API	~$0.10–0.18	~0.03×

At 40 million tokens per day - realistic for a mid-size enterprise with a deployed AI product - the annual cost difference between GPT-4o and Llama 3.1 70B is in the range of $2–3M. That's not a technical footnote. It's a business case for a serious engineering investment in model selection and routing.

"At production scale, model selection is a finance decision as much as an engineering one. Most teams treat it as the latter and ignore the former until the AWS bill arrives."

— From a post-deployment review with a Series C client

What Our Benchmark Actually Found

We tested six models across 14 task categories using 500 labelled examples per task drawn from real client workloads. The ground truth was established by human reviewers, not by another model. The tasks were designed to represent the realistic distribution of enterprise AI work, not synthetic benchmarks.

The 14 tasks across three broad groups:

Classification tasks: intent classification, sentiment analysis, document categorisation, urgency triage, topic tagging
Extraction tasks: named entity extraction, structured data from unstructured text, contract clause identification, date/number normalisation
Generation tasks: customer-facing response drafting, internal summary generation, code review comments, multi-step reasoning, open-ended Q&A from documents

Task Category	GPT-4o	Claude Sonnet	Claude Haiku	Llama 3.1 70B	Llama 3.1 8B
Intent classification	96.2%	95.8%	94.1%	93.7%	88.4%
Sentiment analysis	97.1%	96.9%	96.2%	95.4%	91.8%
Named entity extraction	94.8%	94.3%	91.2%	90.6%	83.1%
Structured data extraction	93.4%	92.8%	89.3%	88.9%	80.2%
Document categorisation	97.6%	97.3%	96.8%	96.1%	93.2%
Customer response drafting	91.2%	90.8%	85.4%	84.1%	74.3%
Multi-step reasoning	88.7%	87.2%	76.1%	74.8%	61.2%
Open-ended Q&A from docs	90.3%	89.6%	81.2%	80.4%	68.7%

The pattern is consistent across all 14 tasks: classification and well-scoped extraction tasks show minimal performance variation between frontier models and Llama 70B. The gap opens significantly for open-ended generation, nuanced reasoning, and tasks with ambiguous or poorly-defined success criteria.

On benchmark methodology

Accuracy figures above are F1 scores for classification/extraction tasks and human preference ratings (scale of 1–5, reported as % scoring 4 or 5) for generation tasks. All evaluations used separate held-out test sets not seen during any prompt engineering. Prompts were standardised across models and optimised independently for each to avoid penalising models with suboptimal default formatting.

Where Smaller Models Actually Win

The headline finding from our benchmark: for six of the fourteen tasks, Llama 3.1 70B performs within 3% of GPT-4o on F1 score, at approximately 6% of the cost. That's not a marginal efficiency gain - for the right workload, it's a structural cost reduction that doesn't require accepting lower quality.

Classification is the clearest case

Intent classification, sentiment analysis, and document categorisation all show the same pattern: once you have a well-defined label set and good examples in the prompt, the performance difference between models narrows dramatically. The task is fundamentally well-constrained. There's a right answer, the model doesn't need to reason beyond the input, and a 70B parameter model has more than enough capacity to handle it reliably.

We ran a production deployment replacing GPT-4o with Llama 3.1 70B on a 12-class intent classification task - 95,000 requests per day. F1 dropped from 94.1% to 93.7%. Monthly inference cost dropped from approximately $18,400 to $1,100. The client chose to reinvest the savings in adding a human review loop for the 6.3% of low-confidence classifications, which improved effective accuracy above the GPT-4o baseline.

Extraction with well-defined schemas

Named entity extraction and structured data extraction show a slightly larger gap, but the 70B results are still competitive for well-defined schemas. The performance decline is most pronounced when the extraction schema is complex, the source text is ambiguous, or the entities are highly domain-specific. For cleaner extraction problems - dates, amounts, standard entity types - smaller models are adequate.

Don't benchmark on your easiest examples. The performance gap between models tends to be small on clean, representative input and large on edge cases. The distribution that matters is your real production distribution, including the 10% of inputs that are ambiguous, truncated, or formatted unusually. Test on those specifically.

Where Frontier Models Are Worth the Premium

The performance advantage of GPT-4o and Claude Sonnet is real. It's just not uniformly distributed across task types. Understanding where the premium is justified makes the decision much cleaner.

Multi-step reasoning

The 14-point accuracy gap between GPT-4o and Llama 70B on multi-step reasoning tasks is the largest in our benchmark and the most consistent across different task formulations. When the model needs to hold multiple intermediate conclusions in working memory, identify which information is relevant to each sub-question, and integrate them into a coherent answer - the frontier model advantage is real and meaningful.

Open-ended generation where quality is subjective

Customer-facing response drafting shows a consistent preference for frontier model outputs in human evaluation. The margin is smaller than multi-step reasoning, but the direction is reliable. Users notice the difference between a well-crafted response and a merely correct one - particularly for anything customer-facing, where tone and specificity affect satisfaction scores.

Ambiguous tasks with no clear ground truth

Any task where the definition of "correct" is itself contested benefits from frontier model capabilities. Legal and compliance interpretation, editorial judgment, nuanced recommendations - these are tasks where the model's ability to represent uncertainty and reason about competing interpretations matters more than raw accuracy on a labelled test set.

LLM Routing: Getting the Best of Both

Once you accept that different tasks warrant different models, the next question is how to operationalise that in a production system. The approach we've found most effective is dynamic LLM routing: a lightweight classifier that scores each incoming request and directs it to the appropriate model tier.

The routing logic doesn't need to be complex. In the simplest version, it's two signals:

Task type: a deterministic mapping from task category to model tier. Classification and extraction tasks → small model. Multi-step reasoning and generation tasks → frontier model.
Input complexity: a simple feature - input length, presence of domain-specific vocabulary, or a fast uncertainty signal from the small model itself - that overrides the task-type mapping for edge cases.

Python Copy

from enum import Enum
from dataclasses import dataclass

class ModelTier(Enum):
    SMALL    = "llama-3.1-70b"
    FRONTIER = "gpt-4o"

# Task-type → tier mapping (deterministic)
TASK_TIER_MAP = {
    "intent_classification": ModelTier.SMALL,
    "sentiment_analysis":    ModelTier.SMALL,
    "document_categorisation":ModelTier.SMALL,
    "entity_extraction":      ModelTier.SMALL,
    "customer_response":      ModelTier.FRONTIER,
    "multi_step_reasoning":   ModelTier.FRONTIER,
    "open_ended_qa":          ModelTier.FRONTIER,
}

def route_request(task_type: str, input_text: str) -> ModelTier:
    base_tier = TASK_TIER_MAP.get(task_type, ModelTier.FRONTIER)

    # Escalate to frontier if input is unusually complex
    if base_tier == ModelTier.SMALL:
        token_count = len(input_text.split())
        if token_count > 1200:          # Long inputs → more ambiguity
            return ModelTier.FRONTIER
        if is_low_confidence(input_text): # Fast uncertainty probe
            return ModelTier.FRONTIER

    return base_tier

In practice, the routing signal that matters most varies by use case. For one financial services client processing loan applications, input length was the strongest predictor of cases that needed frontier model accuracy - short inputs were almost always routine, long inputs contained unusual structures that the small model handled poorly. For a customer support deployment, the presence of certain vocabulary (legal terminology, escalation language, product return requests) was a better signal than length.

Results from a live LLM routing deployment

A B2B SaaS client processing 12 million requests per month moved from GPT-4o for all tasks to a routing architecture directing 68% of requests to Llama 70B. Monthly inference cost reduced from $74,000 to $28,800 - a 61% reduction. Human evaluation of 1,000 randomly sampled outputs showed no statistically significant quality difference. The 32% of requests routed to GPT-4o were the multi-step reasoning and open-ended generation tasks where the frontier advantage is real.

When Fine-Tuning a Small Model Beats Prompting a Large One

The benchmark above tests base models with well-engineered prompts. Fine-tuning changes the comparison materially, and it's underused in most enterprise deployments.

A fine-tuned Llama 3.1 8B on a specific classification task will typically outperform a prompted Llama 70B and often match a prompted GPT-4o - at a fraction of the inference cost. The cases where this makes sense share a few characteristics:

High volume: fine-tuning has upfront cost in data and training compute. It pays off at scale. Below roughly 500,000 requests per month, the economics often don't justify it.
Stable task definition: fine-tuning encodes the task into the model weights. If your classification schema changes frequently, you're re-training frequently. That changes the cost model.
Clean training data available: 1,000–5,000 high-quality labelled examples is the typical minimum for fine-tuning to meaningfully outperform prompt engineering. If you don't have reliable labels, start with prompt engineering and accumulate data from human review.
Low latency requirements: a fine-tuned 8B model will be 3–5× faster at inference than a 70B model. For real-time applications with strict latency budgets, this matters independently of cost.

Why You Need Your Own Evals

The most important practical implication of this benchmark is the least glamorous one: public leaderboard rankings are a poor guide to model selection for specific enterprise tasks. The models that rank highest on MMLU or HumanEval are not reliably the models that perform best on your specific classification taxonomy, your specific document types, or your specific customer query distribution.

Running your own task-specific evals is not optional. The good news is it doesn't have to be expensive. A thoughtfully constructed evaluation set of 200–500 examples with human-verified labels will tell you more about relative model performance on your workload than any public benchmark.

The minimum viable eval framework for model selection:

Sample representatively: include examples from across the full input distribution, deliberately oversampling edge cases and low-frequency input types that have historically caused issues.
Establish ground truth independently: have human reviewers label examples before running any model, not after. Post-hoc labelling is biased toward whichever model output the labeller sees first.
Use the same prompts across models: or spend equivalent effort optimising prompts for each model separately. Either approach is defensible. Mixing the two produces uninterpretable results.
Report variance, not just means: a model with 92% average accuracy and 2% standard deviation is preferable to one with 93% average and 8% standard deviation for most production systems.

The broader point is this: the industry narrative around frontier models is shaped heavily by capability announcements, benchmark results, and use cases where the difference is most visible. The everyday reality of enterprise AI workloads - classification, extraction, triage, summarisation - is a different distribution. For that distribution, smaller and cheaper models deserve a fair evaluation before assuming the frontier premium is warranted.

If you're building a production AI system and haven't run task-specific evals across model tiers, you're probably paying more than you need to for part of your workload. The Sequere engineering team runs model selection engagements as part of most production AI builds - if you'd like to talk through the specifics of your workload, reach out.

The Cost of Context: Why Smaller Models Win More Often Than You'd Think