Fiddler AI developed a documentation chatbot using OpenAI's GPT-3.5 and Retrieval-Augmented Generation (RAG) to help users find answers in their documentation. The project showcases practical implementation of LLMOps principles including continuous evaluation, monitoring of chatbot responses and user prompts, and iterative improvement of the knowledge base. Through this implementation, they identified and documented key lessons in areas like efficient tool selection, query processing, document management, and hallucination reduction.
Fiddler, an AI observability company, developed their own RAG-based documentation chatbot to help users easily find answers from their product documentation. This case study, presented as a guide sharing “10 Lessons” from their development experience, provides practical insights into deploying LLMs in production for conversational AI applications. The chatbot was built using OpenAI’s GPT-3.5 augmented with Retrieval-Augmented Generation (RAG), and was monitored using Fiddler’s own LLM Observability solutions. While this is a vendor-produced guide that naturally promotes their observability tools, it contains substantive technical lessons applicable to anyone building RAG-based chatbot systems.
The chatbot’s architecture combines the generative capabilities of GPT-3.5 with the precision of information retrieval through RAG. This approach allows the chatbot to access and integrate external knowledge sources (Fiddler’s documentation) to provide accurate and contextually relevant responses rather than relying solely on the LLM’s training data.
A significant portion of the development leveraged LangChain as the foundational framework. The case study describes LangChain as a “Swiss Army knife” for chatbot developers, providing essential functionality that would otherwise require extensive custom development. Key benefits included:
The use of LangChain allowed developers to focus on refining chatbot capabilities rather than building infrastructure from scratch. The case study notes that single lines of LangChain code can replace significant amounts of manual coding for common functions.
One of the most substantial technical challenges discussed involves processing user queries effectively. Natural language presents significant complexity because users can ask about the same topic in myriad ways using diverse structures, synonyms, and colloquial expressions.
A particularly challenging issue arises from pronouns and referential expressions like “this,” “that,” or “the aforementioned point.” These work naturally in human conversations where both parties share contextual understanding, but require sophisticated processing for chatbots. The solution involves multi-layered query processing strategies:
LangChain’s question generator template enables processing questions with another LLM to resolve ambiguities before retrieval, though the case study notes that sometimes the original unprocessed query retrieves more relevant documents than processed versions.
The case study dedicates significant attention to document management strategies, particularly addressing the context window limitations inherent to LLMs. Since models can only process limited amounts of text at once, strategic filtering of documents based on query relevance becomes essential.
The concept of “chunking” involves breaking large documents into smaller, manageable parts that fit within context window constraints. However, effective chunking requires more than simple division:
The case study provides a concrete example of a long document called “Customer Churn Prediction” that was chunked with “slugs” (presumably identifiers or connection markers) serving as connectors between chunks.
An important note acknowledges that OpenAI’s developer day updates significantly increased context window length, potentially reducing the need for chunking in some scenarios. This observation highlights the rapidly evolving nature of LLMOps practices.
Rather than relying on a single search query, the chatbot employs multiple retrievals to find comprehensive and relevant information. This approach addresses several scenarios:
Synthesizing information from multiple retrievals requires additional strategies:
The case study emphasizes that prompt engineering is not a one-time activity but an ongoing, iterative process. With new prompt building techniques emerging regularly, an approach tailored to domain-specific use cases becomes essential.
Key aspects of their prompt engineering approach include:
The case study includes examples showing how prompts evolved over time with additional instructions to generate more accurate and helpful responses.
User feedback emerges as a critical component for continuous improvement. The case study makes an interesting observation about feedback patterns: users tend to provide more detailed feedback when dissatisfied, creating potential bias toward negative experiences while overlooking areas of good performance.
To address this challenge, the implementation includes multiple feedback mechanisms:
Making the feedback process intuitive and unobtrusive is emphasized as critical for encouraging participation.
Effective data management extends beyond merely storing queries and responses. The case study emphasizes storing embeddings—not just of queries and responses, but also of the documents used in generating responses. This comprehensive approach enables:
This stored repository of embeddings enables innovative features like nudging users toward frequently asked questions or related topics, keeping them engaged and providing additional value.
Dealing with hallucinations—instances where the chatbot generates incorrect or misleading information confidently—represents one of the most challenging aspects of the project. The case study describes a manual, iterative approach to mitigation:
A notable example involved the chatbot interpreting “LLM” as “local linear model” instead of “large language model” when asked about LLMs. This highlighted gaps in the knowledge base and context understanding. The solution involved adding “caveats”—additional clarifying information—to the knowledge repository. Enriching the documentation with detailed information about LLMs and LLMOps significantly improved response accuracy.
The case study makes an important point that often gets overlooked in technical implementations: the user interface and experience play a pivotal role in building user trust, regardless of backend sophistication.
A key lesson involved shifting from static, block-format responses to streaming responses delivered in real-time. This seemingly small change—making it appear as if the chatbot is typing responses—created a more dynamic and engaging experience that significantly enhanced user perception of natural conversation and increased trust.
Additional UI/UX principles emphasized include simplicity and clarity of interface design, responsive design across devices, and personalization based on user preferences or past interactions.
Implementing conversational memory allows the chatbot to remember and summarize previous parts of conversations. This capability makes interactions more natural and adds personalization and context awareness crucial for user engagement.
A concrete example illustrates this: when a user asks about “Dashboards in Fiddler” and follows up with “how can I use them?”, the chatbot must understand that “them” refers to the previously mentioned dashboards. Without conversational memory, the chatbot would struggle with this reference.
The summarization capability extends this further, providing quick recaps in multi-turn conversations, especially valuable when conversations are resumed after breaks or when dealing with complex subjects.
While this case study provides substantial practical value and technical depth, it should be noted that it originates from Fiddler, a company selling LLM observability solutions. The case study naturally positions their observability tools as essential for this type of development without providing comparative analysis with alternative approaches. The lessons shared appear technically sound and practically applicable, but readers should be aware of the promotional context while extracting value from the specific implementation details and patterns described.
The case study also acknowledges the rapidly evolving nature of the field, particularly noting how OpenAI’s context window expansions may have changed the relevance of certain strategies like aggressive document chunking since the original development work was completed.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.