Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.
Gradient Labs is a young startup based in London, UK, founded by Neil (former Director of Machine Learning at Monzo Bank) and his co-founders. The company is building an AI agent that automates complex customer support work for enterprises end-to-end. The talk, titled “The Rest of the Owl” (referencing the meme about drawing an owl by first drawing two circles and then “drawing the rest”), addresses the significant gap between prototyping with LLMs and actually running AI agents in production.
Neil emphasizes that while prototyping with LLMs is now easier than ever before, the age-old MLOps challenge of getting prototypes into production remains hard—and with AI agents, it’s a “new kind of beast.” The company has launched their AI agent and is running it live with several enterprise customers who care about precise, high-quality outcomes.
Neil presents a tiered view of production ML systems and their evolving complexity:
This framing is valuable because it acknowledges that AI agents represent a qualitative shift in operational complexity, not just an incremental step up from traditional ML systems.
The talk breaks down the AI agents production problem space into four major areas:
Unlike traditional ML systems that passively wait for requests and return results, AI agents must handle a more complex interaction pattern. In customer support, the agent must:
The key insight here is handling “fast and slow race conditions” between the AI agent and whatever is on the other side. Since even a basic agentic workflow making two or three LLM calls is much slower than a typical database API call, this latency creates numerous race condition scenarios that must be managed explicitly.
While it might seem like the LLM layer is “just another API call,” production systems require deeper consideration:
Neil emphasizes that simply throwing documentation into a vector database is “almost a guaranteed way to reach a bad outcome.” The challenges include:
AI agent demos often show impressive tool use (Google searches, code amendments, opening issues), but production enterprise scenarios introduce new challenges:
Gradient Labs architects their AI agents as state machines. Neil presents a simplified example of a conversation as a state machine: one party talks (other enters listening state), they finish (roles swap), and this continues until conversation completes. In practice, their production agent has many more states, some deterministic and some agentic.
The state machine approach allows them to:
Neil shares a screenshot from their production codebase showing a simple example where the AI agent classifies a customer conversation for clarity and responds to ask for clarification if needed. Key decisions in their approach:
When asked about testing agents before deployment, Neil explains that traditional approaches like A/B rollouts and shadow deployments are very feasible for traditional ML but AI agent testing is “multifaceted.” Their approach involves:
Different enterprise customers have varying requirements that must be explicitly designed for:
The framework must support control at both the functional level (setting timers, configuring values) and the behavioral level (topic restrictions).
Gradient Labs is now live with several partners, processing hundreds of conversations per day. Their metric of success includes customers who find the experience so positive that they thank the AI agent for its service. While specific quantitative metrics aren’t shared, Neil notes that for the largest enterprises they work with, a simple RAG-based agent would only address about 10% or less of what their human agents actually handle—highlighting the complexity of real enterprise customer support automation.
It’s worth noting that this talk comes from the CTO of a startup actively selling their AI agent product, so claims should be considered in that context. The lack of specific quantitative metrics (resolution rates, accuracy percentages, cost savings) is notable. However, the technical depth of the discussion about production challenges appears genuine and aligns with broader industry experience. The admission that RAG alone only addresses a small fraction of enterprise needs is refreshingly honest for a company that could easily oversell their solution.
The architectural insights about state machines, durable execution, and the importance of point-in-time knowledge references represent practical wisdom that would be valuable to others building similar systems, regardless of whether one uses Gradient Labs’ specific product.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.