AWS GenAIIC shares comprehensive lessons learned from implementing Retrieval-Augmented Generation (RAG) systems across multiple industries. The case study covers key challenges in RAG implementation and provides detailed solutions for improving retrieval accuracy, managing context, and ensuring response reliability. Solutions include hybrid search techniques, metadata filtering, query rewriting, and advanced prompting strategies to reduce hallucinations.
This case study comes from the AWS Generative AI Innovation Center (GenAIIC), a team of AWS science and strategy experts who help customers build proofs of concept using generative AI. Since its inception in May 2023, the team has observed significant demand for chatbots capable of extracting information and generating insights from large, heterogeneous knowledge bases. This document represents a distillation of lessons learned from building RAG solutions across diverse industries, making it particularly valuable for understanding real-world LLMOps challenges.
The case study is notably comprehensive and technically detailed, covering the entire RAG pipeline from document ingestion to answer generation, with particular emphasis on production-ready optimization techniques. While the content originates from AWS and naturally promotes AWS services like Amazon Bedrock, the techniques and insights are broadly applicable and represent genuine field experience rather than pure marketing material.
The document provides a clear breakdown of the RAG (Retrieval-Augmented Generation) architecture into three core components: retrieval, augmentation, and generation. The retrieval component takes a user question and fetches relevant information from a knowledge base (typically an OpenSearch index). The augmentation phase adds this retrieved information to the FM (Foundation Model) prompt alongside the user query. Finally, the generation step produces an answer using the augmented context.
A key insight emphasized throughout is that “a RAG is only as good as its retriever.” The team found that when RAG systems perform poorly, the retrieval component is almost always the culprit. This observation drives much of the optimization focus in the document.
The document describes two main retrieval failure modes encountered in production:
The ingestion pipeline involves chunking documents into manageable pieces and transforming each chunk into high-dimensional vectors using embedding models like Amazon Titan. These embeddings have the property that semantically similar text chunks have vectors that are close in cosine or Euclidean distance.
The implementation uses OpenSearch Serverless as the vector store, with custom chunking and ingestion functions. The document emphasizes that vectors must be stored alongside their corresponding text chunks so that when relevant vectors are identified during search, the actual text can be retrieved and passed to the FM prompt.
An important operational consideration is consistency between ingestion and retrieval: the same embedding model must be used at both ingestion time and search time for semantic search to work correctly.
The document acknowledges that evaluating RAG systems is “still an open problem,” which reflects a genuine challenge in the LLMOps space. The evaluation framework described includes:
Retrieval Metrics:
The document notes that if documents are chunked, metrics must be computed at the chunk level, requiring ground truth in the form of question and relevant chunk pairs.
Generation Evaluation: The team describes two main approaches for evaluating generated responses:
The recommendation is to use FM-based evaluation for rapid iteration but rely on human evaluation for final assessment before deployment. This balanced approach acknowledges both the practical need for automated evaluation and the limitations of current LLM-based judging.
Several evaluation frameworks are mentioned: Ragas, LlamaIndex, and RefChecker (an Amazon Science library for fine-grained hallucination detection).
The document advocates for combining vector search with keyword search to handle domain-specific terms, abbreviations, and product names that embedding models may struggle with. The implementation combines k-nearest neighbors (k-NN) queries with keyword matching in OpenSearch, with adjustable weights for semantic versus keyword components. The example code shows how to structure these queries with function_score clauses.
A concrete use case described involves a manufacturer’s product specification chatbot, where queries like “What is the viscosity of product XYZ?” need to match both the product name (via keywords) and the semantic concept (via embeddings).
When product specifications span multiple pages and the product name appears only in the header, chunks without the product name become difficult to retrieve. The solution is to prepend document metadata (like title or product name) to each chunk during ingestion, improving both keyword and semantic matching.
For more structured filtering, the document describes using OpenSearch metadata fields with match_phrase clauses to ensure exact product name matching. This requires extracting metadata at ingestion time, potentially using an FM to extract structured information from unstructured documents.
Query rewriting uses an FM (specifically recommending smaller models like Claude Haiku for latency reasons) to transform user queries into better search queries. The technique can:
The example shows a prompt that outputs structured JSON with rewritten_query, keywords, and product_name fields.
This technique addresses the context fragmentation problem where relevant information spans multiple chunks. After retrieving the most relevant chunks through semantic or hybrid search, adjacent chunks are also retrieved based on chunk number metadata and merged before being passed to the FM.
A more sophisticated variant is hierarchical chunking, where child chunks are linked to parent chunks, and retrieval returns the parent chunks even when child chunks match. Amazon Bedrock Knowledge Bases supports this capability.
For structured documents, using section delimiters (from HTML or Markdown) to determine chunk boundaries creates more coherent chunks. This is particularly useful for how-to guides, maintenance manuals, and documents requiring broad context for answers.
The document notes an important practical consideration: section-based chunking creates variable-sized chunks that may exceed the context window of some embedding models (Cohere Embed is limited to 500 tokens), making Amazon Titan Text Embeddings (8,192 token context) more suitable.
As a last resort when other optimizations fail, the document describes fine-tuning custom embeddings using the FlagEmbedding library. This requires gathering positive question-document pairs, generating hard negatives (documents that seem relevant but aren’t), and fine-tuning on these pairs. The fine-tuned model should be combined with a pre-trained model to avoid overfitting.
The document provides practical prompt engineering techniques:
An example from a sports scouting chatbot illustrates how without guardrails, the FM might fill gaps using its training knowledge or fabricate information about well-known players.
A more sophisticated approach involves prompting the FM to output supporting quotations, either in structured form (with <scratchpad> and <answer> tags) or as inline citations with quotation marks. This has dual benefits: it guides the FM to ground its responses in source material, and it enables programmatic verification of citations.
The document provides detailed prompts for both approaches and notes tradeoffs: structured quotes work well for simple queries but may not capture enough information for complex questions; inline quotes integrate better with comprehensive summaries but are harder to verify.
The case study describes implementing Python scripts to verify that quotations actually appear in referenced documents. However, it acknowledges limitations: FM-generated quotes may have minor formatting differences (removed punctuation, corrected spelling) that cause false negatives in verification. The recommended UX approach displays verification status to users and suggests manual checking when quotes aren’t found verbatim.
The document describes a fully custom RAG implementation using:
While Amazon Bedrock Knowledge Bases provides a managed alternative (with the retrieve_and_generate and retrieve APIs), the custom approach enables more control over the pipeline and allows implementation of advanced techniques like query rewriting and metadata filtering.
The document describes several production use cases across industries:
This case study provides substantial practical value for LLMOps practitioners, offering specific code examples and detailed implementation guidance. The techniques described address real production challenges based on actual customer deployments. However, as AWS content, it naturally emphasizes AWS services and may not adequately cover alternative implementations or potential limitations of the AWS-centric approach. The evaluation section acknowledges the immaturity of RAG evaluation methodologies, which is honest but also highlights an area where practitioners need to develop their own robust testing frameworks rather than relying solely on the mentioned libraries.
Swisscom, Switzerland's leading telecommunications provider, developed a Network Assistant using Amazon Bedrock to address the challenge of network engineers spending over 10% of their time manually gathering and analyzing data from multiple sources. The solution implements a multi-agent RAG architecture with specialized agents for documentation management and calculations, combined with an ETL pipeline using AWS services. The system is projected to reduce routine data retrieval and analysis time by 10%, saving approximately 200 hours per engineer annually while maintaining strict data security and sovereignty requirements for the telecommunications sector.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.
Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.