ZenML

Four Critical Lessons from Building 50+ Global Chatbots: A Practitioner's Guide to Real-World Implementation

Campfire AI 2024
View original source

Drawing from experience building over 50 chatbots across five continents, this case study outlines four crucial lessons for successful chatbot implementation. Key insights include treating chatbot projects as AI initiatives rather than traditional IT projects, anticipating out-of-scope queries through "99-intents", organizing intents hierarchically for more natural interactions, planning for unusual user expressions, and eliminating unhelpful "I don't understand" responses. The study emphasizes that successful chatbots require continuous optimization, aiming for 90-95% recognition rates for in-scope questions, while maintaining effective fallback mechanisms for edge cases.

Industry

Tech

Technologies

Overview

This case study comes from Campfire AI, a conversational AI company that has built over 50 chatbots for enterprises across five continents spanning Europe, Africa, Asia, and the Americas. The article, published by Sinch (a communications platform-as-a-service provider), shares practical lessons learned from deploying production chatbots across multiple industries. While the piece has some promotional elements given its publication context, the technical insights it provides are grounded in real-world operational experience and offer valuable guidance for teams deploying conversational AI systems in production.

The central thesis of the case study is that while conversational AI technology has matured significantly, most chatbot implementations still fail to deliver good user experiences. The author argues this is primarily an implementation problem rather than a technology limitation. The lessons shared focus on the operational and design challenges that emerge when moving from chatbot prototypes to production systems that must handle the unpredictable nature of real user interactions.

The Core Challenge: AI Projects vs Traditional IT Projects

One of the most important conceptual frameworks introduced in this case study is the distinction between building a chatbot as an AI project versus a traditional IT project. The author uses website development as a contrasting example. When building a website, the scope is predefined and finite—teams design user journeys, decide button placements, and define what happens when users click them. Testing involves verifying that buttons work correctly and layouts render properly across devices.

Chatbot development fundamentally differs because providing users with a keyboard input makes their possible actions infinite rather than finite. Users can type literally anything, which means even with a well-defined initial scope, the bot must handle expressions and queries that fall outside its training. This philosophical understanding is crucial for LLMOps practitioners because it shapes how teams should approach testing, monitoring, and iteration cycles.

Handling Out-of-Scope Queries: False Positives and True Negatives

A significant portion of the case study addresses the challenge of out-of-scope queries, which the author breaks down into two distinct failure modes:

False Positives: These occur when a user asks something outside the bot’s scope, but the NLP model incorrectly matches it to an in-scope intent with high confidence. Ironically, the author notes that a more robust and well-trained in-scope NLP model can actually increase the likelihood of false positives because it becomes more aggressive at matching inputs to known intents. This is a counterintuitive but important insight for production systems—sometimes better training on in-scope intents can degrade the overall user experience if out-of-scope handling isn’t also addressed.

True Negatives: These happen when the bot correctly fails to understand an out-of-scope query but then asks the user to rephrase their question. The problem is that rephrasing won’t help because the question was genuinely outside the bot’s knowledge domain. Users and bots get stuck in frustrating “not understood” loops.

The solution proposed is to create what the author calls “99-intents”—these are deliberately trained intents that capture categories and topics that are explicitly out of scope. Their purpose is to attract out-of-scope questions and then define appropriate next steps for users, such as directing them to human agents or alternative resources. This is an important production pattern because it means teams should actively model what their bot cannot do, not just what it can do.

Hierarchical Intent Organization

The case study provides a detailed example from a banking context to illustrate the importance of organizing intents into a knowledge hierarchy from broad to specific. When a user reports a lost card, the bot might need to ask whether it was a debit or credit card because the processes differ. However, if the user has already specified “credit card” in their initial message, asking again makes the bot appear to have not understood them.

Even more problematic is when a user provides very specific information (like “Amex Gold”) that the bot cannot map correctly. If the bot only expects “debit or credit” as valid responses and doesn’t recognize that Amex Gold is a credit card, users get stuck in loops. The solution involves grouping similar intents together, organizing them from broad to specific, and using a combination of intents and entity extraction to determine how specific the response should be.

This hierarchical approach leverages what makes AI valuable—the ability to immediately narrow in on specific user needs rather than forcing users through predetermined navigation flows. The author points out that on a website, accessing Amex Gold information might require clicking through 5 or 6 levels of navigation, whereas a well-designed chatbot can go directly to the specific answer.

Training for Linguistic Diversity

A particularly practical section of the case study addresses the challenge of training NLP models to handle the wide variety of ways users express the same concept. Using a Belgian banking example, the author describes how users might refer to a card reader device using terms like “card reader,” “bank device,” “pin machine,” “digipass,” or regional slang terms like “bakje” or “kaske” (which translate to “small tray” or “container”). Some users don’t even use specific nouns and instead describe the device functionally: “the thing I use to pay online.”

This linguistic diversity applies to both individual words and complete situational descriptions. A user might say “while getting out of my car, I dropped my card in the door and closed it” to express that their card is broken. The author raises an important question: how far should teams go in training for these unusual expressions? Adding words like “car” and “door” to the broken card intent could cause false positives when those words appear in completely unrelated contexts.

The author recommends targeting 90% recognition rates for in-scope questions, with the possibility of reaching 95% depending on the bot and scope. Training beyond 95% typically involves edge cases and exceptions that are better handled by human agents. This provides a concrete operational target for production systems.

Iterative Model Optimization

The case study describes the typical trajectory of a production chatbot launch. When first deployed, an NLP model typically recognizes approximately 70% of incoming in-scope questions. Teams must then review incoming questions, update the NLP model with new expressions, deploy the updated model, and repeat this cycle until reaching the 90% recognition target.

This iterative optimization approach is fundamental to LLMOps. Unlike traditional software where bugs can be definitively fixed, conversational AI uses confidence scores and probabilistic matching. Nothing is ever “black and white,” as the author notes, which means continuous monitoring and incremental improvement are essential parts of operating these systems in production.

Graceful Failure and Safety Nets

Perhaps the most user-experience-focused lesson is about eliminating dead-end error messages. When a bot responds with “Sorry, I don’t know the answer,” it creates a negative experience because it doesn’t help the user progress toward their goal. The author uses a Wi-Fi outage example: if a user contacts their provider’s chatbot about connectivity issues and the bot responds with an unhelpful error message, the user now has two problems instead of one.

The recommended approach is to design elaborate “safety net” flows that provide alternative paths to solutions even when the bot cannot directly answer a question. These alternatives might include directing users to FAQ pages, providing links to relevant videos, offering customer support phone numbers, or connecting users to live agents. The goal is to exhaust every possible way to help the user before concluding the interaction.

At Campfire AI, the team develops these safety net flows as a standard part of their chatbot implementations. This represents a shift in thinking from “what can the bot do?” to “how can the bot always add value, even when it can’t solve the problem directly?”

Production Considerations and Limitations

While this case study provides valuable operational insights, it’s worth noting some limitations in its scope. The article focuses primarily on traditional NLP-based chatbots using intent classification and entity extraction rather than large language models (LLMs) in the modern sense. The techniques described—intent hierarchies, entity recognition, confidence scoring—are more characteristic of classical conversational AI systems than the generative AI approaches that have become prevalent since 2022-2023.

That said, many of the lessons remain applicable to LLM-based systems. The distinction between in-scope and out-of-scope queries, the importance of graceful fallback handling, and the need for iterative optimization based on production traffic are all relevant to deploying any conversational AI system at scale.

The case study also doesn’t provide specific metrics on business outcomes or user satisfaction improvements, which would strengthen the evidence for these recommendations. The author’s claim of building over 50 chatbots suggests significant experience, but without case-specific data, readers must take the general principles on faith.

Overall, this case study provides practical, field-tested guidance for teams deploying conversational AI systems in production environments, with particular emphasis on the operational and design patterns that distinguish successful implementations from the frustrating chatbot experiences that have given the technology a mixed reputation among consumers.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Enterprise-Grade GenAI Platform with Multi-Cloud Architecture

Coinbase 2024

Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.

customer_support chatbot question_answering +36

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36