ZenML

AI Assistant for Financial Data Discovery and Business Intelligence

Amazon Finance 2025
View original source

Amazon Finance developed an AI-powered assistant to address analysts' challenges with data discovery across vast, disparate financial datasets and systems. The solution combines Amazon Bedrock (using Anthropic's Claude 3 Sonnet) with Amazon Kendra Enterprise Edition to create a Retrieval Augmented Generation (RAG) system that enables natural language queries for finding financial data and documentation. The implementation achieved a 30% reduction in search time, 80% improvement in search result accuracy, and demonstrated 83% precision and 88% faithfulness in knowledge search tasks, while reducing information discovery time from 45-60 minutes to 5-10 minutes.

Industry

Finance

Technologies

Amazon Finance successfully deployed an AI-powered assistant in production to solve critical data discovery and business intelligence challenges faced by financial analysts across the organization. This case study demonstrates a comprehensive LLMOps implementation that combines multiple AWS services to create an enterprise-grade solution for natural language interaction with financial data and documentation.

Business Problem and Context

Amazon Finance analysts were struggling with mounting complexity in financial planning and analysis processes when working with vast datasets spanning multiple systems, data lakes, and business units. The primary challenges included time-intensive manual browsing of data catalogs, difficulty reconciling data from disparate sources, and the inability to leverage historical data and previous business decisions that resided in various documents and legacy systems. Traditional keyword-based searches failed to capture contextual relationships in financial data, and rigid query structures limited dynamic data exploration. The lack of institutional knowledge preservation resulted in valuable insights and decision rationales becoming siloed or lost over time, leading to redundant analysis and inconsistent planning assumptions across teams.

Technical Architecture and LLMOps Implementation

The production solution implements a sophisticated Retrieval Augmented Generation (RAG) architecture that demonstrates several key LLMOps principles and practices. At the core of the system is Amazon Bedrock, which provides the LLM serving infrastructure using Anthropic’s Claude 3 Sonnet model. The choice of Claude 3 Sonnet was made specifically for its exceptional language generation capabilities and ability to understand and reason about complex financial topics, which is critical for production deployment in a high-stakes domain like finance.

The retrieval component leverages Amazon Kendra Enterprise Edition Index rather than Amazon OpenSearch Service or Amazon Q Business. This architectural decision reflects important LLMOps considerations around accuracy, maintainability, and operational overhead. Amazon Kendra provides out-of-the-box natural language understanding, automatic document processing for over 40 file formats, pre-built enterprise connectors, and intelligent query handling including synonym recognition and refinement suggestions. The service automatically combines keyword, semantic, and vector search approaches, whereas alternatives would require manual implementation and ongoing maintenance of these features.

The system architecture follows a multi-tier approach typical of production LLM deployments. User queries are processed through a Streamlit frontend application, which sends queries to the Amazon Kendra retriever for relevant document retrieval. Amazon Kendra returns relevant paragraphs and document references to the RAG solution, which then uses Anthropic’s Claude through Amazon Bedrock along with carefully crafted prompt templates to generate contextual responses. The responses are returned to the Streamlit UI along with feedback mechanisms and session history management.

Prompt Engineering and Template Design

The case study highlights the importance of prompt engineering in production LLM systems. The team implemented structured prompt templates that format user queries, integrate retrieved knowledge, and provide specific instructions and constraints for response generation. The example prompt template demonstrates best practices for production systems by explicitly instructing the model to acknowledge when it doesn’t know an answer rather than hallucinating information, which is particularly crucial in financial applications where accuracy is paramount.

The prompt template structure follows the pattern of providing context from retrieved documents, followed by the user question, and explicit instructions about how to handle uncertain information. This approach helps ensure that the LLM’s responses are grounded in the retrieved knowledge base rather than relying solely on the model’s training data, which is essential for maintaining accuracy and reliability in production financial applications.

Deployment Architecture and Scalability

The frontend deployment architecture demonstrates production-ready LLMOps practices with emphasis on scalability, security, and performance. The system uses Amazon CloudFront for global content delivery with automatic geographic routing to minimize latency. Authentication is handled through AWS Lambda functions that verify user credentials before allowing access to the application, ensuring enterprise security standards are maintained.

The backend is deployed using AWS Fargate for containerized execution without infrastructure management overhead, combined with Amazon Elastic Container Service (ECS) configured with automatic scaling based on Application Load Balancer requests per target. This serverless approach allows the system to scale dynamically based on demand while minimizing operational overhead, which is crucial for production LLM applications that may experience variable usage patterns.

Evaluation Framework and Production Monitoring

The implementation includes a comprehensive evaluation framework that demonstrates mature LLMOps practices around testing and monitoring. The team implemented both quantitative and qualitative assessment methodologies to ensure the system meets the high standards required for financial applications. The quantitative assessment focused on precision and recall testing using a diverse test set of over 50 business queries representing typical analyst use cases, with human-labeled answers serving as ground truth.

The evaluation framework distinguished between two main use cases: data discovery and knowledge search. Initial results showed data discovery achieving 65% precision and 60% recall, while knowledge search demonstrated 83% precision and 74% recall. These metrics provided a baseline for ongoing system improvement and represented significant improvements over previous manual processes that had only 35% success rates and required multiple iterations.

The qualitative evaluation centered on faithfulness metrics, using an innovative “LLM-as-a-judge” methodology to evaluate how well the AI assistant’s responses aligned with source documentation and avoided hallucinations. This approach is particularly relevant for production LLM systems where output quality and reliability must be continuously monitored. The faithfulness scores of 70% for data discovery and 88% for knowledge search provided concrete metrics for system reliability that could be tracked over time.

User Feedback Integration and Continuous Improvement

The production system includes built-in feedback mechanisms that enable continuous improvement, a critical aspect of LLMOps. User feedback on responses is stored in Amazon S3, creating a data pipeline for analyzing system performance and identifying areas for improvement. This feedback loop allows the team to understand where the system succeeds and fails in real-world usage, enabling iterative improvements to both the retrieval system and the generation components.

The user satisfaction metrics (92% preference over traditional search methods) and efficiency improvements (85% reduction in information discovery time) demonstrate the business impact of the LLM deployment. These metrics serve as key performance indicators for the production system and help justify continued investment in the LLMOps infrastructure.

Operational Considerations and Challenges

The case study reveals several important operational considerations for production LLM systems. The team identified that the lack of rich metadata about data sources was a primary factor limiting system performance, particularly in data discovery scenarios. This insight led to organizational changes around metadata collection practices, demonstrating how LLM deployments can drive broader data governance improvements.

The system’s performance varied significantly between use cases, with knowledge search substantially outperforming data discovery tasks. This variation highlights the importance of understanding how different types of queries interact with RAG systems and the need for potentially different optimization strategies for different use cases within the same production system.

Security and Compliance

The implementation demonstrates enterprise-grade security practices essential for production LLM systems handling sensitive financial data. The system maintains enterprise security standards through Amazon Kendra’s built-in data protection and compliance features, authentication mechanisms through AWS Lambda functions, and secure document storage in Amazon S3 buckets with appropriate access controls.

The choice of AWS-managed services for the core LLM infrastructure (Amazon Bedrock) rather than self-hosted models reflects important production considerations around security, compliance, and operational overhead. Using managed services allows the team to focus on application-level concerns while relying on AWS for infrastructure security, model updates, and compliance certifications.

This case study represents a comprehensive example of production LLMOps implementation that addresses the full lifecycle from problem identification through deployment, monitoring, and continuous improvement, while maintaining the security and reliability standards required for enterprise financial applications.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47

Building a Microservices-Based Multi-Agent Platform for Financial Advisors

Prudential 2025

Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.

healthcare fraud_detection customer_support +48