Newday: Generative AI Customer Service Agent Assist with RAG Implementation

LLMOps Database

Finance

Newday

Company

Newday

Title

Generative AI Customer Service Agent Assist with RAG Implementation

Industry

Finance

Link

https://aws.amazon.com/blogs/machine-learning/newday-builds-a-generative-ai-based-customer-service-agent-assist-with-over-90-accuracy?tag=soumet-20

Year

2025

Summary (short)

NewDay, a UK financial services company handling 2.5 million customer calls annually, developed NewAssist, a real-time generative AI assistant to help customer service agents quickly find answers from nearly 200 knowledge articles. Starting as a hackathon project, the solution evolved from a voice assistant concept to a chatbot implementation using Amazon Bedrock and Claude 3 Haiku. Through iterative experimentation and custom data processing, the team achieved over 90% accuracy, reducing answer retrieval time from 90 seconds to 4 seconds while maintaining costs under $400 per month using a serverless AWS architecture.

NewDay is a UK-based financial services company that provides credit solutions to approximately 4 million customers. With their contact center handling 2.5 million calls annually and customer service agents needing to navigate nearly 200 knowledge articles to answer customer questions, the company identified a significant operational challenge that could benefit from generative AI solutions. The journey began with a hackathon in early 2024, where the team explored how generative AI could improve both speed to resolution and overall customer and agent experience. The initial concept, called NewAssist, was ambitious - a real-time generative AI assistant with speech-to-text capabilities that would provide context-aware support during live customer interactions. However, the team quickly encountered several significant challenges that required strategic pivoting. Managing costs while competing with other strategic initiatives proved difficult, particularly when securing executive buy-in. The existing legacy infrastructure was not conducive to rapid experimentation, and the technology itself was still relatively unproven in terms of delivering concrete business value. Recognizing these constraints, the team made a pragmatic decision to scale back their ambitions and focus on a more achievable proof of concept. They shifted from a voice assistant to a chatbot solution, concentrating on validating whether their existing knowledge management system could work effectively with generative AI technology. This strategic pivot proved crucial, as it allowed them to establish a solid foundation for long-term generative AI initiatives while managing risk and resource constraints. The technical implementation followed a Retrieval Augmented Generation (RAG) architecture built on AWS serverless infrastructure. The solution consists of five main components, each serving a specific purpose in the overall system. The user interface was developed using the Streamlit framework, providing a simple AI assistant interface where users can log in, ask questions, and provide feedback through thumbs up/down ratings with optional comments. Authentication was implemented through Amazon Cognito with Microsoft Entra ID integration, enabling single sign-on capabilities for customer service agents. The interface is hosted on AWS Fargate, ensuring scalability and cost-effectiveness. The knowledge base processing component became the most critical element for achieving high accuracy. This component retrieves articles from third-party knowledge bases via APIs, processes them through a carefully designed chunking strategy, converts the chunks to vector embeddings, and stores them in a vector database implemented using OpenSearch Serverless. The team discovered that this component was responsible for a 40% increase in accuracy, highlighting the importance of data processing in RAG implementations. For suggestion generation, the system forwards questions from the UI to retrieve the most relevant chunks from the vector database, then passes these chunks to the large language model for generating contextual responses. The team chose Anthropic's Claude 3 Haiku as their preferred LLM, accessed through Amazon Bedrock. This choice was driven by two key factors: cost-effectiveness and meeting their response time requirement of maximum 5 seconds. While more performant models became available during development, the team maintained their choice due to the balance between cost, performance, and latency requirements. The observability component logs all questions, answers, and feedback into Snowflake, with dashboards created to track various metrics including accuracy. This creates a continuous feedback loop where business experts review answers with negative feedback weekly, and AI engineers translate these insights into experiments to improve performance. Additionally, Amazon CloudWatch logs all requests processed by the AWS services in the architecture, providing comprehensive monitoring capabilities. The offline evaluation component ensures quality control by testing new versions of NewAssist against an evaluation dataset in pre-production environments. Only versions that surpass specified accuracy thresholds are deployed to production, maintaining quality standards while enabling rapid iteration. One of the most significant technical breakthroughs came from understanding and optimizing data processing. Initially, the team used a simple approach: manually extracting articles from the data source, saving them as PDFs, and using PyPDF for parsing. This approach yielded only around 60% accuracy because it didn't account for the specific structure and format of NewDay's knowledge articles. The breakthrough came when the team invested time in understanding their data structure. NewDay's knowledge articles are created by contact center experts using a specific methodology and stored in a third-party content management system that supports various widgets like lists, banners, and tables. The system provides APIs that return articles as JSON objects, with each object containing a widget following a specific schema. By studying each widget schema and creating bespoke parsing logic that extracts relevant content and formats it appropriately, the team achieved a dramatic improvement. This custom data processing approach increased accuracy from 60% to 73% without touching the AI component, demonstrating the critical importance of data quality in generative AI applications. The team followed a culture of experimentation, implementing the Improvement Kata methodology with rapid Build-Measure-Learn cycles. Over 10 weeks and 8 experiment loops, they systematically improved the solution. A small cross-functional team of three experts created a golden dataset of over 100 questions with correct answers to evaluate the generative AI application's response accuracy. User testing revealed unexpected insights about how agents actually interact with the system. When the solution was expanded to 10 experienced customer service agents, the team discovered that agents frequently used internal acronyms and abbreviations in their questions. For example, they would ask "How do I set a dd for a cst?" instead of "How do I set a direct debit for a customer?" This usage pattern caused accuracy to drop to 70% when agents started using the system. The solution involved statically injecting common acronyms and abbreviations into the LLM prompt, helping it better understand agent questions. This approach gradually restored accuracy to over 80%. The team also discovered that agents began using NewAssist in unexpected ways, such as asking how to explain technical processes to customers in simple terms, which proved to be a valuable use case. Through continued experimentation and refinement, including filling gaps in the knowledge base and training the system on internal language and acronyms, the team achieved over 90% accuracy. This success enabled them to secure executive approval for productionization and expansion to over 150 agents, with plans to extend the solution to all departments within Customer Operations. The business impact has been substantial. The solution reduced the average time to retrieve answers from 90 seconds to 4 seconds, representing a dramatic improvement in operational efficiency. The serverless architecture keeps running costs under $400 per month, demonstrating the cost-effectiveness of the approach. The case study reveals several important lessons about LLMOps in production environments. First, the importance of embracing experimentation culture with agile, iterative approaches and methodologies like Build-Measure-Learn cycles. Starting small with focused proof of concepts and gradually scaling helps validate effectiveness before full deployment. Second, data quality emerges as a critical factor that can yield substantial improvements in accuracy. Investing time in understanding and properly processing data specific to the organization's context and structure is essential for success. Third, understanding how users actually interact with the product is crucial. Real-world testing with actual users uncovers unexpected usage patterns that require system adaptation. The team learned to accommodate internal jargon and abbreviations while remaining open to unforeseen use cases that emerge during user testing. The monitoring and feedback mechanisms implemented in the solution demonstrate mature LLMOps practices. The combination of real-time logging, dashboard-based metrics tracking, and systematic review processes creates a continuous improvement loop that enables ongoing optimization. The choice of technologies and architecture reflects thoughtful consideration of production requirements. The serverless approach using AWS services provides scalability while managing costs, and the selection of Claude 3 Haiku balances performance, cost, and latency requirements. The integration of authentication and user interface components creates a complete production-ready solution. Looking forward, NewDay plans to continue optimizing the solution through their established feedback mechanisms and explore deeper integrations with AWS AI services. The robust foundation they've built enables further refinement of the balance between human and machine intelligence in customer interactions. This case study demonstrates how organizations can successfully implement generative AI solutions in production environments by focusing on iterative experimentation, data quality, user feedback, and appropriate technology choices. The transformation from hackathon idea to production system showcases the potential for rapid innovation when creativity meets robust cloud infrastructure and systematic LLMOps practices.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source