ZenML

Tuning RAG Search for Production Customer Support Chatbot

Elastic 2024
View original source

Elastic's Field Engineering team developed and improved a customer support chatbot using RAG and LLMs. They faced challenges with search relevance, particularly around CVE and version-specific queries, and implemented solutions including hybrid search strategies, AI-generated summaries, and query optimization techniques. Their improvements resulted in a 78% increase in search relevance for top-3 results and generated over 300,000 AI summaries for future applications.

Industry

Tech

Technologies

Overview

This case study from Elastic details how their Field Engineering team iteratively improved the relevance of a RAG (Retrieval-Augmented Generation) powered customer support chatbot called the “Support AI Assistant.” The article is part of a multi-part blog series and focuses specifically on the search tuning aspects of their RAG implementation, demonstrating the critical importance of retrieval quality in production LLM applications. While the content originates from Elastic’s own engineering blog and naturally showcases their Elasticsearch product, the technical details provide genuine insights into real-world RAG optimization challenges.

The team started with a functional but imperfect system and through systematic analysis of user feedback and query patterns, identified specific failure modes that they then addressed through targeted technical solutions. Their approach emphasizes that RAG tuning is fundamentally a search problem requiring both high-quality data and accurate retrieval, noting that without both elements working together, LLMs risk hallucinating and generating misleading responses that erode user trust.

Initial Architecture and Data Foundation

The system was built on a foundation of over 300,000 documents including Technical Support Knowledge Articles, Product Documentation, and blog content crawled from Elastic’s website. All of this data was stored and indexed in Elasticsearch, leveraging the platform’s native capabilities for both traditional keyword search and semantic search.

For querying, the team implemented a hybrid search strategy combining BM25 (keyword-based search) with ELSER (Elastic Learned Sparse EncodeR) for semantic search. The semantic component used text_expansion queries against both title and summary embeddings, while the keyword component employed cross_fields matching with minimum_should_match parameters tuned for longer queries. Phrase matches received higher boosts as they typically signal greater relevance.

The generation pipeline was straightforward: after search, the system built a system prompt with instructions and the top 3 search results as context, then fed the conversation alongside the context into GPT-4 via Azure OpenAI (using Provisioned Throughput Units). The limitation to only 3 results was driven by token constraints in their dedicated deployment combined with a large user base.

For feedback collection, the team used a third-party tool to capture client-side events, storing data in BigQuery for analysis. This was supplemented by direct qualitative feedback from internal users, which proved instrumental in identifying relevance issues.

Challenge 1: CVE (Common Vulnerabilities and Exposures) Queries

One of the first significant issues identified was poor performance on CVE-related queries. Customers frequently encounter security alerts about CVEs and naturally asked questions like “Does CVE-2016-1837 or CVE-2019-11756 affect Elastic products?” Despite having dedicated, well-maintained CVE Knowledge Articles, the search results were returning only one relevant hit out of three, leaving the LLM without sufficient context to generate proper answers.

The solution leveraged a key insight: users often include near-exact CVE codes in their queries, and these codes typically appear in the article titles. The team implemented conditional boosting of title field matches for such queries. By detecting CVE patterns and significantly increasing the weight of title matches when they occur, they improved precision dramatically for this query type.

This solution demonstrates an important LLMOps principle: while semantic search is powerful, understanding and leveraging user intent patterns can provide simple but highly effective optimizations for specific query types.

Challenge 2: Product Version Queries

A more complex challenge emerged with queries involving specific product versions. Users frequently asked about features, migration guides, or version comparisons, but the search results were returning irrelevant content. A query about Elasticsearch 8.14.2 would return results about Apache Hadoop version 8.14.1, APM version 8.14, and other unrelated content.

Investigation revealed three distinct problems:

Inaccurate Semantic Matching: The root cause was that the summary field for Product Documentation articles was simply the first few characters of the document body, generated during the crawling process. This redundancy created vector embeddings that poorly represented the actual document content. To solve this, the team developed a custom AI Enrichment Service using GPT-4 to generate four new fields: a proper summary, category classification, tags, and topics. This process created over 300,000 AI-generated summaries that enriched the underlying indices and enabled much better semantic matching.

Multiple Versions of Same Articles: The top 3 results were often cluttered with different versions of the same article (e.g., documentation for versions 8.14, 8.13, 8.12), reducing diversity in the context provided to the LLM. The team addressed this using Elasticsearch’s collapse parameter with a computed slug field that identifies different versions of the same article, ensuring only the top-scored version appears in results.

Wrong Versions Being Returned: Even when the right article was retrieved, it wasn’t always the correct version. The team implemented version extraction from user queries using regex patterns, then added conditional boosting of the version field to prioritize results matching the requested version.

Data Enrichment Service

The AI Enrichment Service deserves special attention as it represents a significant investment in data quality. The team built this internal tool for several reasons: they had unused PTU (Provisioned Throughput Units) resources available in Azure OpenAI, the data quality gap was their greatest relevance detractor, and they wanted full customization for experimentation.

The service generated four fields for each document:

These fields were ingested into a new index and made available to the target indices through Enrich Processors. The AI-generated embeddings were stored under ai_fields.ml.inference, providing dramatically improved semantic matching capabilities. Beyond the immediate relevance improvements, this created a valuable data asset of 300,000+ AI-generated summaries for potential future applications.

Evaluation Framework

The team implemented a systematic evaluation approach using Elasticsearch’s Ranking Evaluation API with Precision@K (P@K) metrics, where K=3 to match the number of results fed to the LLM. They compiled a test suite of 12 queries spanning diverse topics (CVE queries, version comparisons, feature configuration, API usage) with curated lists of relevant results for each.

A TypeScript/Node.js script automated the evaluation process, running both the “before” and “after” query versions against their development Elasticsearch instance and outputting P@K values for each query. This enabled quantitative measurement of improvements across their test suite.

Results and Metrics

The evaluation showed significant improvements across multiple query types:

The overall average P@K improvement was reported as 78.41% (the headline ~75% figure). Importantly, no queries showed regression - some remained stable while others improved significantly.

The team acknowledged that some queries (Proxy Certificates Rotation, Enrich Processor Setup) showed no improvement, highlighting that relevance optimization is an ongoing effort requiring continuous iteration and expansion of test cases.

Future Directions

The article outlines three areas for continued improvement:

Data Curation: The team observed that semantically close search candidates often led to less effective responses, and some crawled pages generate noise rather than value. Future work involves curating and enhancing the knowledge base to make it leaner and more effective.

Conversational Context in RAG: The current system doesn’t handle follow-up questions well. A query like “What about Windows?” following a Linux configuration question should be reformulated to include the full conversational context before searching.

Conditional Context Inclusion: By extracting semantic meaning from user questions, the system could conditionally include only relevant pieces of data as context, saving token limits and potentially reducing round trips to external services.

Critical Assessment

While this case study provides valuable technical insights, it’s worth noting some limitations. The evaluation was performed on a relatively small test suite of 12 queries, which may not be representative of the full breadth of user queries. The metrics focus exclusively on retrieval precision and don’t directly measure end-user satisfaction or LLM response quality. Additionally, as a vendor blog post, there’s inherent bias toward showcasing Elasticsearch’s capabilities.

That said, the case study demonstrates solid LLMOps practices: systematic identification of failure modes through user feedback analysis, targeted technical solutions for specific query patterns, investment in data quality as a foundation for retrieval quality, and quantitative evaluation of improvements. The approach of treating RAG optimization as fundamentally a search problem, rather than focusing solely on the LLM component, provides a useful framework for practitioners building similar systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Building a Production RAG-based Customer Support Assistant with Elasticsearch

Elastic 2024

Elastic's Field Engineering team developed a customer support chatbot using RAG instead of fine-tuning, leveraging Elasticsearch for document storage and retrieval. They created a knowledge library of over 300,000 documents from technical support articles, product documentation, and blogs, enriched with AI-generated summaries and embeddings using ELSER. The system uses hybrid search combining semantic and BM25 approaches to provide relevant context to the LLM, resulting in more accurate and trustworthy responses.

customer_support chatbot document_processing +18