Manulife implemented a Retrieval Augmented Generation (RAG) system in their call center to help customer service representatives quickly access and utilize information from both structured and unstructured data sources. They developed an innovative approach combining document chunks and structured data embeddings, achieving an optimized response time of 7.33 seconds in production. The system successfully handles both policy documents and database information, using GPT-3.5 for answer generation with additional validation from Llama 3 or GPT-4.
This case study presents a comprehensive examination of how Manulife, a major financial institution, implemented a production-grade RAG system to enhance their call center operations. The implementation showcases a sophisticated approach to handling both structured and unstructured data sources while maintaining high performance and accuracy standards in a production environment.
The core business challenge was that customer service representatives (CSRs) had to navigate multiple data sources, including unstructured policy documents and structured database information, which often contained overlapping information. This complexity made it difficult for CSRs to provide prompt responses to customer queries. The solution needed to handle decades worth of business data spread across various sources efficiently.
Technical Implementation:
The system architecture consists of four major components: ingestion and indexing, inference, monitoring, and user interface. Here's a detailed breakdown of the technical implementation:
Data Ingestion and Vectorization:
* Unstructured data (PDF documents) are chunked into 400-token segments with overlaps
* Text-embedding-ada-002 model is used to create 1536-dimensional vectors
* Vectors are stored in Azure AI Search using their SDK
* Structured data undergoes an innovative preprocessing approach:
* Database tables are de-normalized and aggregated by business concepts
* Tables are reduced from 50 million to 4.5 million rows through processing
* Each row is converted to a JSON string with headers as keys and cells as values
* These JSON strings are then vectorized and stored in the vector database
Inference Pipeline:
* The system retrieves top-k relevant chunks from both structured and unstructured indexes based on query similarity
* Retrieved chunks are combined into a prompt
* GPT-3.5 generates the initial answer
* A separate validation step uses either Llama 3 or GPT-4 to provide a confidence rating for the generated answer
Production Optimizations:
* Initial response time of 21.91 seconds was optimized to 7.33 seconds
* The team implemented continuous monitoring of confidence ratings and human feedback
* Data pipeline updates and prompt revisions were made to address occasional errors from missing data or ambiguous contexts
What makes this implementation particularly noteworthy is their innovative approach to handling structured data. Instead of using traditional text-to-SQL or raw SQL queries, which can increase latency and potentially introduce errors, they developed a method to directly embed structured data. This approach:
* Reduces the number of LLM calls needed during inference
* Improves overall system latency
* Enhances accuracy by treating structured data in a way that preserves its semantic meaning
The system has been running successfully in production since May 2024, consistently generating accurate, grounded answers without hallucinations. The team has implemented robust monitoring and feedback mechanisms to ensure continued quality and performance.
Technical Challenges and Solutions:
The team faced several challenges in implementing RAG at scale:
* Handling large volumes of structured data efficiently
* Maintaining low latency while ensuring accuracy
* Dealing with data quality issues and missing information
* Managing the complexity of multiple data sources
Their solutions included:
* Innovative data preprocessing and aggregation techniques
* Optimization of vector storage and retrieval
* Implementation of a dual-model approach for generation and validation
* Continuous monitoring and feedback loops for system improvement
Architecture and Infrastructure:
* The system utilizes Azure cloud services extensively
* Microsoft SharePoint for document storage
* Azure Data Lake Storage for data ingestion
* Azure Synapse Lake for big data analysis
* Azure AI Search for vector storage and retrieval
The case study also provides valuable insights into model selection in industrial applications, comparing various open-source and closed-source models for answer generation. This comparison helps other organizations make informed decisions about model selection for similar use cases.
Impact and Results:
* Significant reduction in response time from 21.91 to 7.33 seconds
* Consistent accuracy in answer generation
* Successful handling of both structured and unstructured data sources
* Improved efficiency for customer service representatives
The implementation demonstrates how careful attention to system architecture, data processing, and performance optimization can result in a successful production deployment of LLM technology in a regulated industry like finance. The team's approach to combining structured and unstructured data processing in a RAG system provides a valuable blueprint for similar implementations in other organizations.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.