Autonomous Agent Handles 94% of Tier-1 Ops Tasks

The Situation

This global e-commerce platform processes over 40,000 orders daily across 12 markets. At that scale, exceptions aren't edge cases — they're a constant stream. Mis-shipments, inventory mismatches, failed payments, supplier delays: the operational noise was relentless.

Three senior engineers were spending the majority of their time on these tasks. Not debugging them — actually doing them. Manually reviewing exception queues, emailing suppliers, reconciling inventory counts, drafting customer comms. Work that required judgment but not creativity. Work that was making smart people miserable.

The VP of Engineering put it plainly in our first call: "I'm paying senior engineer salaries for work a well-trained agent should be doing."

The Architecture

We designed a three-agent system, each responsible for a distinct class of operational task. The agents share a common context layer — order data, inventory state, supplier SLAs — but operate independently with human-in-the-loop escalation for edge cases outside their confidence threshold.

Agent 1: Exception Resolver

The exception resolver monitors the order management system in real time. For each exception, it classifies the root cause, checks resolution history for similar cases, and executes the appropriate action: rerouting a shipment, issuing a replacement, or escalating to a human with a pre-drafted resolution plan.

Tools used: order management API, shipping carrier APIs, customer communication system, resolution history database.

Confidence threshold: exceptions where the agent is less than 92% confident in the resolution are escalated to a human with full context pre-loaded.

Agent 2: Inventory Reconciler

Inventory discrepancies between the platform's system of record and warehouse management systems were a daily occurrence — and a daily manual task. The reconciler runs every 4 hours, identifies mismatches, traces them to their source (receiving error, system lag, theft adjustment), and either auto-corrects or flags for physical investigation.

What previously took an engineer 90 minutes of manual reconciliation per day now runs unattended. The engineer reviews a 3-minute summary report.

Agent 3: Supplier Communication

Supplier communication — chasing delayed shipments, requesting updated ETAs, flagging quality issues — was eating 2–3 hours per day across the ops team. The supplier agent monitors purchase order status, identifies at-risk orders, and drafts and sends supplier communications in the appropriate tone and format for each supplier relationship.

The agent has full context on each supplier's communication preferences, SLA history, and relationship sensitivity. It escalates contentious conversations to a human with a full brief.

The Build Process

The build took 8 weeks from discovery to production. The first two weeks were entirely spent on data access and context design — getting the agents accurate, real-time access to the systems they needed to act on. Bad context produces bad agent decisions; this step is never shortcuttable.

Weeks 3–5 were parallel development of the three agents, with daily reviews against a test dataset of 500 real historical exceptions. We iterated on tool definitions, confidence thresholds, and escalation logic until the agents were resolving test cases at the same rate as the human team.

Weeks 6–8 were shadow mode: agents ran in parallel with the human team, their decisions logged but not executed. The human team validated agent decisions daily. Agreement rate reached 94% by the end of week 7. We went live in week 8.

The Results

Six months post-launch, the agents handle 94% of Tier-1 operational tasks autonomously. The 6% that escalate to humans are genuinely complex cases that benefit from human judgment — not noise that slipped through.

The three engineers who were doing this work are now building product. One of them shipped a feature in their first sprint back that directly increased checkout conversion by 1.8%.

The operational cost saving — salary, tooling, and error-related costs — is estimated at £1.2M annually.

Key Takeaways

Agentic systems need accurate, real-time context before anything else — invest in the data layer first
Shadow mode deployment (run alongside humans before replacing them) dramatically reduces go-live risk
Confidence thresholds and escalation design matter as much as the agent logic itself
The ROI isn't just cost savings — it's what your best people do with their time back