Customer Support Automation: 78% Ticket Deflection

The situation before we got involved

The client had built a strong brand selling premium homeware across three markets - roughly 3,000 SKUs covering furniture, lighting, kitchen, and bedding. Their product experience was excellent: quality packaging, reliable delivery, a returns process that actually worked. But their support operation had not kept pace with the business.

At 40,000 incoming tickets per month, the team was running approximately 60 full-time agents across two shifts. The overwhelming majority of tickets - delivery status checks, return initiations, product sizing questions, compatibility queries - were entirely routine. The same questions arrived hundreds of times each day, pulling agents away from the genuine problems that actually required human judgment.

Direct support headcount was costing $1.4M annually. The less visible problem was morale: experienced agents were spending 70% of their time on queries that a well-written template could answer, which drove attrition in a team that had become genuinely good at the more complex tickets.

"We don't want to remove the human element - that's part of our brand. We want our agents to actually be there for customers when it matters. Right now they're mostly copying tracking links into reply boxes."

Why building this properly was harder than expected

E-commerce support automation has been tried and abandoned by many businesses because generic chatbots perform poorly and actively damage brand perception when they fail. Our brief explicitly ruled out anything that felt robotic or deflecting. Three constraints shaped everything that followed.

Product catalogue depth

With 3,000 SKUs across categories with different sizing conventions, material properties, and compatibility requirements, the system needed actual product knowledge - not keyword matching. A customer asking whether a specific extendable dining table would seat eight in a particular room needed a real answer, not a redirect to the product page they'd already read.

Brand voice consistency

The client's written tone is warm, specific, and unhurried. Previous chatbot experiments had failed not because they were technically wrong, but because they sounded nothing like the brand. Fine-tuning on 200k historical agent responses solved this - the model learned not just what to say, but how this company talks to its customers.

Escalation that doesn't feel like a wall

The failure mode we were most determined to avoid: a system that confidently produced a wrong answer and left the customer with nowhere to go. When the system wasn't confident, it needed to say something honest and human, collect enough context, and hand off to an agent who was already briefed - not start the conversation from scratch.

How the system works

The architecture runs in five stages. We rebuilt stages two and three twice before reaching the production version.

System Architecture - Production Simplified

Ticket Ingestion & Intent Classification

Tickets arrive via Zendesk integration. A fine-tuned DistilBERT classifier routes each ticket into one of 14 intent categories - order status, return request, sizing query, compatibility, complaint - before the main model sees it. This classification step is cheap and significantly improves downstream accuracy by narrowing the generation context.

Zendesk API DistilBERT classifier 14 intent categories

Context Retrieval - Orders + Product Catalogue

For order-related tickets, the system pulls live order status and returns history from the client's OMS via a read-only API. For product queries, it retrieves the relevant SKU data from a structured catalogue index built on PostgreSQL with pgvector - dimensions, materials, care instructions, compatibility notes. Order data is cached in Redis with a 15-minute TTL to keep latency flat under traffic spikes.

OMS API (read-only) PostgreSQL + pgvector Redis (15min TTL)

Fine-Tuned GPT-4o - Streaming Generation

The fine-tuned model receives the classified intent, the retrieved context, and the customer's original message. It was fine-tuned on 200,000 historical agent responses selected for a CSAT score above 4.2 out of 5 - so it generates in brand voice from token one. Responses stream directly to the customer interface via SSE with median TTFB of 280ms. Each response includes an internal confidence score not visible to the customer.

GPT-4o fine-tuned Streaming SSE Confidence scoring

Confidence Gate & Automated Resolution

If the confidence score clears the threshold - tuned separately for each intent category - the response is sent and the ticket is closed automatically with a follow-up CSAT prompt. Order status tickets use a stricter threshold (92%) than general product queries (85%), because incorrect order information causes direct customer harm in a way that an imprecise product description does not. Threshold calibration took two weeks of A/B testing against agent baselines.

Per-intent thresholds Auto-close + CSAT prompt A/B calibration

Human Escalation with Full Context Handoff

The roughly 22% of tickets that fall below the confidence threshold are escalated to the human queue. The agent receives the AI's draft response, the retrieved order and product context, and the specific reason for escalation - they don't start from scratch. Average agent handle time on escalated tickets dropped from 8 minutes to 3 minutes because the retrieval and drafting work is already done.

Zendesk queue routing Pre-populated context Draft + edit workflow

What went wrong, and what we learned

Two significant rebuilds happened between the initial prototype and production. Documenting them is more useful than pretending the path was straight.

False start #1 - Vanilla GPT-4o with a system prompt

The first version used stock GPT-4o with a detailed system prompt describing the brand, tone, and catalogue. Technically accurate, but customers noticed something was off. The language was too thorough, occasionally too formal, and would sometimes invent plausible-sounding product care instructions for SKUs it hadn't been given specific data on. Fine-tuning on 200k real agent responses fixed the tone entirely. Coupling the fine-tune with structured catalogue retrieval eliminated hallucinations by giving the model actual data rather than asking it to synthesise from training.

False start #2 - A single confidence threshold across all intent types

Early testing used one threshold for everything. Order status tickets were escalating so frequently that the cost savings evaporated - the whole point was volume deflection. Meanwhile, product compatibility tickets were auto-resolving at a confidence level that produced occasional wrong answers. Per-intent thresholds, calibrated over two weeks of live testing, brought both error rates into acceptable range and unlocked the economics that made the project viable.

The TTFB constraint

The client's UX team had a hard requirement: first visible characters within 400ms. Early sequential API calls - classify, then retrieve, then generate - were producing 600–800ms TTFB. Parallelising the retrieval steps (order data and catalogue lookup run simultaneously while classification completes) and switching to streaming SSE delivery brought median TTFB to 280ms. CloudFront edge caching of catalogue data handled tail latency during peak periods.

The results, nine months in

The system went live in Q3 2024. Numbers below cover the nine months ending Q2 2025.

Metric

Before (baseline)

After (9 months live)

Ticket deflection rate

0% (fully manual)

78%

Average first response time

4.2 hours

<1 min (automated) / 28 min (escalated)

Customer satisfaction (CSAT)

88% (human agents)

95% (AI-handled tickets)

Support headcount

60 agents

18 agents (escalations only)

Annual support cost

$1.4M

$0.38M (inc. infrastructure)

Agent handle time (escalated tickets)

8 minutes average

3 minutes average

Time to first byte (TTFB)

N/A

<400ms (median 280ms)

$1.02M Net cost saving in year one

↑ vs $1.4M baseline cost

78% Tickets closed with zero agent involvement

↑ from 0%

95% Customer satisfaction - above human baseline

↑ from 88%

What happened to the 60-person support team

This comes up every time we present this project. It deserves a direct answer.

The client did not make 42 agents redundant. They had been running at capacity and turning down inbound volume during product launches and sale events because the team couldn't process it fast enough. With the volume constraint removed, the commercial team invested more aggressively in customer acquisition. Ticket volume grew 35% in the nine months after launch. The 18-agent team handling escalations is dealing with more tickets than the old team - but the tickets are genuinely complex: complaints, high-value order issues, edge cases in returns policy that the AI correctly identified as requiring judgment.

Attrition in the remaining team dropped significantly. The agents who stayed are spending their time on the work that needs them.

Three things we'd do differently

Start fine-tune data curation in week one. We spent three weeks debating whether fine-tuning was necessary or whether prompt engineering alone would suffice. It wasn't a close call in hindsight - the brand voice problem was real, and only the fine-tune solved it. Those three weeks were wasted. The data filtering, anonymisation, and format work should have started immediately.

Build the agent escalation view before launch, not after. We delivered escalation as a thin Zendesk integration. Agents found the pre-populated context cluttered and hard to scan. We rebuilt the escalation card in week eight - a clean view showing order history, the AI's draft, and the confidence reason in one viewport. It should have been designed properly from the start. It directly affected how quickly agents trusted the handoff.

Build a formal 30-day calibration phase into the contract. The first month of live operation is calibration, not performance. Confidence thresholds need tuning against real production data. Edge cases that didn't appear in testing start surfacing. The client's team needs time to trust the handoff flow. We now write an explicit calibration phase with defined metrics targets into every project of this type - the go-live date is not the performance measurement date.

Customer support automation: 78% ticket deflection rate

The situation before we got involved

Why building this properly was harder than expected

Product catalogue depth

Brand voice consistency

Escalation that doesn't feel like a wall

How the system works

What went wrong, and what we learned

False start #1 - Vanilla GPT-4o with a system prompt

False start #2 - A single confidence threshold across all intent types

The TTFB constraint

The results, nine months in

What happened to the 60-person support team

Three things we'd do differently

Have a document-heavy process
that needs real intelligence applied to it?

Customer support automation: 78% ticket deflection rate

The situation before we got involved

Why building this properly was harder than expected

Product catalogue depth

Brand voice consistency

Escalation that doesn't feel like a wall

How the system works

What went wrong, and what we learned

False start #1 - Vanilla GPT-4o with a system prompt

False start #2 - A single confidence threshold across all intent types

The TTFB constraint

The results, nine months in

What happened to the 60-person support team

Three things we'd do differently

Have a document-heavy processthat needs real intelligence applied to it?

Have a document-heavy process
that needs real intelligence applied to it?