Company
Intercom
Title
Scaling an Autonomous AI Customer Support Agent from Demo to Production
Industry
Tech
Year
2023
Summary (short)
Intercom developed Finn, an autonomous AI customer support agent, evolving it from early prototypes with GPT-3.5 to a production system using GPT-4 and custom architecture. Initially hampered by hallucinations and safety concerns, the system now successfully resolves 58-59% of customer support conversations, up from 25% at launch. The solution combines multiple AI processes including disambiguation, ranking, and summarization, with careful attention to brand voice control and escalation handling.
## Overview Intercom, a company that pioneered modern in-app chat, has been at the forefront of deploying autonomous AI agents for customer support at scale. This case study, presented by Des Traynor (co-founder and Chief Strategy Officer), provides extensive insights into the journey from early prototypes to a production system serving approximately 30,000 customers. The conversation reveals the significant differences between building traditional software and building AI-powered agents, with particular emphasis on evaluation, testing, architecture decisions, and the operational challenges of running LLM-based systems in production. ## The Journey from Demo to Production The timeline of Finn's development illustrates a common pattern in LLMOps: early prototypes that look impressive but cannot handle real-world conditions. When ChatGPT launched on November 30, 2022, Intercom had AI features in their inbox by December and first prototypes of Finn by January 2023. However, these early versions running on GPT-3.5 could not be safeguarded against hallucinations. Des described these demos as only impressive "as long as we could control what you said" - a situation many AI companies still face today. The turning point came with GPT-4's release in March 2023, which provided the reliability needed to actually deploy a fully autonomous agent in front of real customer support teams. This underscores a critical LLMOps lesson: the underlying model capabilities fundamentally determine what is possible in production, and teams must be prepared to wait for or switch to models that meet their reliability requirements. ## Evaluation and Testing: The Torture Test One of the most valuable LLMOps insights from this case study is Intercom's approach to evaluation through what they call a "torture test." This comprehensive evaluation framework covers several categories of failure modes that are unacceptable in customer support: - Giving wrong answers - Making up answers (hallucinations) - Taking inappropriate opinions (political stances, commentary on world events) - Recommending competitors - Brand voice violations and tone issues Des emphasized that meaningful evaluation must focus on edge cases, not easy questions. He used the analogy of comparing himself to Einstein by asking "what's 2 plus 2" - both would get it right, but that tells you nothing about actual capability. The real differentiation happens in hard cases, which is where Intercom "earns money." For organizations building their own torture tests, Des recommends including: the last 20 hard questions that frontline support couldn't answer, the 10 most common questions, security-conscious tests (prompt injection resistance), and edge cases around sales conversations, documentation gaps, and escalation scenarios. ## Agent Architecture and Multi-Component Systems A crucial technical insight is that Finn is not a single-shot LLM call but rather "a subsystem of about 15 different processes." These include: - Disambiguation processes - Chunking - Reranking (using a bespoke ranker) - Summarization - RAG retrieval (using a bespoke RAG engine) - The core answer generation component Des noted that when asked about model improvements like DeepSeek releases, the impact on Finn's resolution rate is relatively small (only 2-3% attributable to model changes). The much larger gains come from agent architecture improvements (they're on their third major architecture), prompt engineering, and the peripheral components like their custom RAG and ranking systems. This represents an important LLMOps pattern: sophisticated production systems derive most of their value from the orchestration and supporting infrastructure rather than raw model capabilities. Intercom has invested heavily in perfecting these components specifically for their domain, with advantages from previous AI products that gave them pre-tuned components. ## Continuous Improvement and Production Learning The AI development lifecycle Des describes differs fundamentally from traditional software development. Rather than the linear path of research → wireframe → build → beta → ship, AI product development involves: - Brainstorming possibilities - Internal piloting to see if concepts work - Committing to building - Hoping it works in the wild (an honest acknowledgment of uncertainty) - Extensive testing even after launch because 30,000 customers will find edge cases - Constant iteration based on production observations Finn's resolution rate improved from approximately 25% at launch to 58-59% at the time of this conversation - entirely through post-launch improvements based on production observations. This represents the reality of LLMOps: deployment is not an endpoint but a starting point for continuous improvement. ## Configuration and Customization Challenges A significant portion of the discussion addressed how to handle customer-specific customization in an agent product. Des described the tension between flexibility and reliability: the tempting approach is to expose a large prompt text box and let customers describe whatever behavior they want, but this leads to problems because "customers are not prompt engineers." Instead, Intercom has learned to identify common "vectors of differentiation" across their customer base and build structured settings for these. Specific examples include: - Tone of voice controls (with guidance on how to properly instruct the model) - Answer length settings (where simply saying "be really short" can produce rude responses) - Escalation settings for specific keywords or situations - Brand-specific terminology preferences The key insight is that configuration should be applied after the core answer is generated - first generate a good answer, then apply customer-specific guidance to adapt it. This architecture prevents customer configurations from interfering with the fundamental quality of responses. ## Human-Agent Interaction and Escalation The system monitors for signals that indicate when human intervention is needed: - Direct feedback asking "did this answer your question" - Frustration signals (including caps lock usage) - Configured escalation keywords (scam, hacked, compromised) - Industry-specific triggers (e.g., mental health concerns in gaming) - Negative behavioral signals like users immediately starting a new conversation Importantly, the default behavior is not to immediately hand over to a human but to ask if the user wants to speak with one, as sometimes apparent dissatisfaction may simply be the user reading and clicking through the provided answer. Different customers have different thresholds for escalation, different handover methods (live agent vs. asynchronous follow-up vs. callback request), and different wait times before handover. This creates what Des calls "a sea of preferences" that must be navigated in the product design. ## Impact on Support Team Structure As AI handles more routine inquiries (now 58-59% resolution), the human support work becomes more specialized and complex. Des described the emerging topology of AI-augmented support teams: - Fewer total roles relative to support volume - More specialization with roles like "principal product specialist" - Higher-paid positions requiring deep product expertise - Career paths emerging for support staff - "Human in the loop" approval workflows for actions like refunds - Dedicated roles for managing the AI agent and its knowledge base - Documentation becoming "mission critical infrastructure" The question of compound reliability in multi-step tasks came up - if each step is only 90% reliable and you chain three together, performance degrades significantly. Des compared this to management: "Have you ever been a manager? That's literally what it is. Johnny's hung over, Debbie didn't show up. We still need the system to work." ## Tooling and Supporting Infrastructure A key LLMOps theme is that building an agent requires building extensive surrounding tooling for customers to manage that agent. This includes: - Test suites for batch testing and scenario testing - Analytics dashboards for understanding what's happening - Conversation analysis tools - Configuration interfaces for behavior customization - Monitoring for agent performance metrics Des emphasized that discussions about "thin wrappers" miss the point: "you don't realize how much software you have to build to handle 50-60-70% of the support of a large company." ## AI in Intercom's Own Development Process Beyond Finn, Intercom uses AI coding tools extensively across their ~500 person R&D team: - Default tools include Windsurf and Cursor for all engineers - Claude Code is common especially in their AI group - Tools like Lovable are used as alternatives to Figma for rapid prototyping The most impactful change Des identified is in design and prototyping: using AI tools to explore visualization options (tree maps, bubble diagrams, etc.) in two hours rather than weeks of manual wireframing. The ability to "explore the solution space" through rapid prompting, then "exploit" by deep refinement of the chosen direction, dramatically compresses the design iteration cycle. For effective AI-assisted development, Des stressed the importance of well-maintained codebases and rules files. Legacy codebases with mixed libraries and architectural generations may not benefit as much from AI coding assistance, creating a potential competitive disadvantage against new entrants who build AI-native from the start. ## Customer Reception Intercom has approximately 5,000 Finn customers generating revenue in the tens of millions, growing rapidly. Des described customer reactions ranging from relief ("a glass of ice water to somebody in hell" - attributed to Anthropic as a Finn customer) for overwhelmed support teams to amusing anecdotes of end users preferring the bot over human agents because they don't want to bother a real person with their questions.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.