Intercom developed Finn, an autonomous AI customer support agent, evolving it from early prototypes with GPT-3.5 to a production system using GPT-4 and custom architecture. Initially hampered by hallucinations and safety concerns, the system now successfully resolves 58-59% of customer support conversations, up from 25% at launch. The solution combines multiple AI processes including disambiguation, ranking, and summarization, with careful attention to brand voice control and escalation handling.
Intercom, a company that pioneered modern in-app chat, has been at the forefront of deploying autonomous AI agents for customer support at scale. This case study, presented by Des Traynor (co-founder and Chief Strategy Officer), provides extensive insights into the journey from early prototypes to a production system serving approximately 30,000 customers. The conversation reveals the significant differences between building traditional software and building AI-powered agents, with particular emphasis on evaluation, testing, architecture decisions, and the operational challenges of running LLM-based systems in production.
The timeline of Finn’s development illustrates a common pattern in LLMOps: early prototypes that look impressive but cannot handle real-world conditions. When ChatGPT launched on November 30, 2022, Intercom had AI features in their inbox by December and first prototypes of Finn by January 2023. However, these early versions running on GPT-3.5 could not be safeguarded against hallucinations. Des described these demos as only impressive “as long as we could control what you said” - a situation many AI companies still face today.
The turning point came with GPT-4’s release in March 2023, which provided the reliability needed to actually deploy a fully autonomous agent in front of real customer support teams. This underscores a critical LLMOps lesson: the underlying model capabilities fundamentally determine what is possible in production, and teams must be prepared to wait for or switch to models that meet their reliability requirements.
One of the most valuable LLMOps insights from this case study is Intercom’s approach to evaluation through what they call a “torture test.” This comprehensive evaluation framework covers several categories of failure modes that are unacceptable in customer support:
Des emphasized that meaningful evaluation must focus on edge cases, not easy questions. He used the analogy of comparing himself to Einstein by asking “what’s 2 plus 2” - both would get it right, but that tells you nothing about actual capability. The real differentiation happens in hard cases, which is where Intercom “earns money.”
For organizations building their own torture tests, Des recommends including: the last 20 hard questions that frontline support couldn’t answer, the 10 most common questions, security-conscious tests (prompt injection resistance), and edge cases around sales conversations, documentation gaps, and escalation scenarios.
A crucial technical insight is that Finn is not a single-shot LLM call but rather “a subsystem of about 15 different processes.” These include:
Des noted that when asked about model improvements like DeepSeek releases, the impact on Finn’s resolution rate is relatively small (only 2-3% attributable to model changes). The much larger gains come from agent architecture improvements (they’re on their third major architecture), prompt engineering, and the peripheral components like their custom RAG and ranking systems.
This represents an important LLMOps pattern: sophisticated production systems derive most of their value from the orchestration and supporting infrastructure rather than raw model capabilities. Intercom has invested heavily in perfecting these components specifically for their domain, with advantages from previous AI products that gave them pre-tuned components.
The AI development lifecycle Des describes differs fundamentally from traditional software development. Rather than the linear path of research → wireframe → build → beta → ship, AI product development involves:
Finn’s resolution rate improved from approximately 25% at launch to 58-59% at the time of this conversation - entirely through post-launch improvements based on production observations. This represents the reality of LLMOps: deployment is not an endpoint but a starting point for continuous improvement.
A significant portion of the discussion addressed how to handle customer-specific customization in an agent product. Des described the tension between flexibility and reliability: the tempting approach is to expose a large prompt text box and let customers describe whatever behavior they want, but this leads to problems because “customers are not prompt engineers.”
Instead, Intercom has learned to identify common “vectors of differentiation” across their customer base and build structured settings for these. Specific examples include:
The key insight is that configuration should be applied after the core answer is generated - first generate a good answer, then apply customer-specific guidance to adapt it. This architecture prevents customer configurations from interfering with the fundamental quality of responses.
The system monitors for signals that indicate when human intervention is needed:
Importantly, the default behavior is not to immediately hand over to a human but to ask if the user wants to speak with one, as sometimes apparent dissatisfaction may simply be the user reading and clicking through the provided answer.
Different customers have different thresholds for escalation, different handover methods (live agent vs. asynchronous follow-up vs. callback request), and different wait times before handover. This creates what Des calls “a sea of preferences” that must be navigated in the product design.
As AI handles more routine inquiries (now 58-59% resolution), the human support work becomes more specialized and complex. Des described the emerging topology of AI-augmented support teams:
The question of compound reliability in multi-step tasks came up - if each step is only 90% reliable and you chain three together, performance degrades significantly. Des compared this to management: “Have you ever been a manager? That’s literally what it is. Johnny’s hung over, Debbie didn’t show up. We still need the system to work.”
A key LLMOps theme is that building an agent requires building extensive surrounding tooling for customers to manage that agent. This includes:
Des emphasized that discussions about “thin wrappers” miss the point: “you don’t realize how much software you have to build to handle 50-60-70% of the support of a large company.”
Beyond Finn, Intercom uses AI coding tools extensively across their ~500 person R&D team:
The most impactful change Des identified is in design and prototyping: using AI tools to explore visualization options (tree maps, bubble diagrams, etc.) in two hours rather than weeks of manual wireframing. The ability to “explore the solution space” through rapid prompting, then “exploit” by deep refinement of the chosen direction, dramatically compresses the design iteration cycle.
For effective AI-assisted development, Des stressed the importance of well-maintained codebases and rules files. Legacy codebases with mixed libraries and architectural generations may not benefit as much from AI coding assistance, creating a potential competitive disadvantage against new entrants who build AI-native from the start.
Intercom has approximately 5,000 Finn customers generating revenue in the tens of millions, growing rapidly. Des described customer reactions ranging from relief (“a glass of ice water to somebody in hell” - attributed to Anthropic as a Finn customer) for overwhelmed support teams to amusing anecdotes of end users preferring the bot over human agents because they don’t want to bother a real person with their questions.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.