The situation before we got involved
The client had built a strong brand selling premium homeware across three markets - roughly 3,000 SKUs covering furniture, lighting, kitchen, and bedding. Their product experience was excellent: quality packaging, reliable delivery, a returns process that actually worked. But their support operation had not kept pace with the business.
At 40,000 incoming tickets per month, the team was running approximately 60 full-time agents across two shifts. The overwhelming majority of tickets - delivery status checks, return initiations, product sizing questions, compatibility queries - were entirely routine. The same questions arrived hundreds of times each day, pulling agents away from the genuine problems that actually required human judgment.
Direct support headcount was costing $1.4M annually. The less visible problem was morale: experienced agents were spending 70% of their time on queries that a well-written template could answer, which drove attrition in a team that had become genuinely good at the more complex tickets.
"We don't want to remove the human element - that's part of our brand. We want our agents to actually be there for customers when it matters. Right now they're mostly copying tracking links into reply boxes."
Why building this properly was harder than expected
E-commerce support automation has been tried and abandoned by many businesses because generic chatbots perform poorly and actively damage brand perception when they fail. Our brief explicitly ruled out anything that felt robotic or deflecting. Three constraints shaped everything that followed.
Product catalogue depth
With 3,000 SKUs across categories with different sizing conventions, material properties, and compatibility requirements, the system needed actual product knowledge - not keyword matching. A customer asking whether a specific extendable dining table would seat eight in a particular room needed a real answer, not a redirect to the product page they'd already read.
Brand voice consistency
The client's written tone is warm, specific, and unhurried. Previous chatbot experiments had failed not because they were technically wrong, but because they sounded nothing like the brand. Fine-tuning on 200k historical agent responses solved this - the model learned not just what to say, but how this company talks to its customers.
Escalation that doesn't feel like a wall
The failure mode we were most determined to avoid: a system that confidently produced a wrong answer and left the customer with nowhere to go. When the system wasn't confident, it needed to say something honest and human, collect enough context, and hand off to an agent who was already briefed - not start the conversation from scratch.
How the system works
The architecture runs in five stages. We rebuilt stages two and three twice before reaching the production version.
What went wrong, and what we learned
Two significant rebuilds happened between the initial prototype and production. Documenting them is more useful than pretending the path was straight.
False start #1 - Vanilla GPT-4o with a system prompt
The first version used stock GPT-4o with a detailed system prompt describing the brand, tone, and catalogue. Technically accurate, but customers noticed something was off. The language was too thorough, occasionally too formal, and would sometimes invent plausible-sounding product care instructions for SKUs it hadn't been given specific data on. Fine-tuning on 200k real agent responses fixed the tone entirely. Coupling the fine-tune with structured catalogue retrieval eliminated hallucinations by giving the model actual data rather than asking it to synthesise from training.
False start #2 - A single confidence threshold across all intent types
Early testing used one threshold for everything. Order status tickets were escalating so frequently that the cost savings evaporated - the whole point was volume deflection. Meanwhile, product compatibility tickets were auto-resolving at a confidence level that produced occasional wrong answers. Per-intent thresholds, calibrated over two weeks of live testing, brought both error rates into acceptable range and unlocked the economics that made the project viable.
The TTFB constraint
The client's UX team had a hard requirement: first visible characters within 400ms. Early sequential API calls - classify, then retrieve, then generate - were producing 600–800ms TTFB. Parallelising the retrieval steps (order data and catalogue lookup run simultaneously while classification completes) and switching to streaming SSE delivery brought median TTFB to 280ms. CloudFront edge caching of catalogue data handled tail latency during peak periods.
The results, nine months in
The system went live in Q3 2024. Numbers below cover the nine months ending Q2 2025.
What happened to the 60-person support team
This comes up every time we present this project. It deserves a direct answer.
The client did not make 42 agents redundant. They had been running at capacity and turning down inbound volume during product launches and sale events because the team couldn't process it fast enough. With the volume constraint removed, the commercial team invested more aggressively in customer acquisition. Ticket volume grew 35% in the nine months after launch. The 18-agent team handling escalations is dealing with more tickets than the old team - but the tickets are genuinely complex: complaints, high-value order issues, edge cases in returns policy that the AI correctly identified as requiring judgment.
Attrition in the remaining team dropped significantly. The agents who stayed are spending their time on the work that needs them.
Three things we'd do differently
Start fine-tune data curation in week one. We spent three weeks debating whether fine-tuning was necessary or whether prompt engineering alone would suffice. It wasn't a close call in hindsight - the brand voice problem was real, and only the fine-tune solved it. Those three weeks were wasted. The data filtering, anonymisation, and format work should have started immediately.
Build the agent escalation view before launch, not after. We delivered escalation as a thin Zendesk integration. Agents found the pre-populated context cluttered and hard to scan. We rebuilt the escalation card in week eight - a clean view showing order history, the AI's draft, and the confidence reason in one viewport. It should have been designed properly from the start. It directly affected how quickly agents trusted the handoff.
Build a formal 30-day calibration phase into the contract. The first month of live operation is calibration, not performance. Confidence thresholds need tuning against real production data. Edge cases that didn't appear in testing start surfacing. The client's team needs time to trust the handoff flow. We now write an explicit calibration phase with defined metrics targets into every project of this type - the go-live date is not the performance measurement date.