Verisk developed a generative AI companion for their Mozart platform to automate insurance policy document comparison and change detection. Using Amazon Bedrock, OpenSearch, and Anthropic's Claude 3 Sonnet model, they built a system that reduces policy review time from days to minutes. The solution combines embedding-based retrieval, sophisticated prompt engineering, and document chunking strategies to achieve over 90% accuracy in change summaries while maintaining cost efficiency and security compliance.
Verisk, a leading data analytics and technology partner to the global insurance industry (Nasdaq: VRSK), developed Mozart companion—a generative AI-powered feature for their Mozart platform. Mozart is Verisk’s leading platform for creating and updating insurance forms. The companion feature addresses a significant pain point in the insurance industry: the time-consuming process of reviewing changes between policy document versions. Insurance professionals previously spent days or even weeks reviewing policy form changes; the Mozart companion reduces this to minutes by using LLMs to compare documents and generate structured summaries of material changes.
This case study is notable because it demonstrates a production-grade LLMOps implementation that emphasizes iterative improvement, hybrid architectures (combining FM and non-FM solutions), rigorous human evaluation, and strong governance practices.
The solution architecture follows a classic RAG (Retrieval Augmented Generation) pattern with several AWS services forming the backbone:
Data Ingestion and Embedding Pipeline: Policy documents are stored in Amazon S3. An AWS Batch job handles the periodic processing of documents—reading them, chunking them into smaller slices, and creating embeddings using Amazon Titan Text Embeddings through Amazon Bedrock. The embeddings are stored in an Amazon OpenSearch Service vector database. Importantly, metadata about each document (type, jurisdiction, version number, effective dates) is also stored via an internal Metadata API. This periodic job architecture ensures the vector database remains synchronized with new documents as they are added to the system.
Document Splitting Strategy: Verisk tested multiple document splitting strategies before settling on a recursive character text splitter with a 500-character chunk size and 15% overlap. This splitter comes from the LangChain framework and is described as a semantic splitter that considers semantic similarities in text. They also evaluated the NLTK splitter but found the recursive character approach provided better results for their use case.
Inference Pipeline: When a user selects two documents for comparison, an AWS Lambda function retrieves the relevant document embeddings from OpenSearch Service and presents them to Anthropic’s Claude 3 Sonnet (accessed through Amazon Bedrock). The results are structured as JSON and delivered to the frontend via an API service. The output includes change summaries, locations, excerpts from both document versions, and a tracked-change (redline) format.
Hybrid FM/Non-FM Architecture: A key LLMOps insight from this case study is that Verisk deliberately reduced FM load by identifying sections with differences first (using non-FM methods), then passing only those sections to the FM for summary generation. The tracked-difference format with redlines uses a completely non-FM-based solution. This architectural choice improved both accuracy and cost-efficiency.
Verisk’s approach to evaluation is particularly instructive for LLMOps practitioners. They designed human evaluation metrics with in-house insurance domain experts and assessed results across four key criteria:
The evaluation process was explicitly iterative—domain experts graded results on a manual 1-10 scale, and feedback from each round was incorporated into subsequent development cycles. The case study honestly acknowledges that initial results “were good but not close to the desired level of accuracy and consistency,” highlighting the realistic challenges of productionizing LLM solutions.
Notably, Verisk recognized that “FM solutions are improving rapidly, but to achieve the desired level of accuracy, Verisk’s generative AI software solution needed to contain more components than just FMs.” This reflects a mature understanding that production LLM systems require complementary engineering beyond simply calling a foundation model.
The case study emphasizes that meaningful change summaries differ from simple text diffs—the system needs to describe material changes while ignoring non-meaningful textual differences. Verisk created prompts leveraging their in-house domain expertise and refined them iteratively based on testing rounds. Key prompt engineering techniques mentioned include:
The case study highlights Verisk’s governance framework, which is particularly relevant for enterprise LLMOps deployments in regulated industries like insurance. Verisk has a governance council that reviews generative AI solutions against standards for security, compliance, and data use. Legal review covers IP protection and contractual compliance. Key security concerns addressed include ensuring data is transmitted securely, confirming the FM doesn’t retain Verisk’s data, and verifying the FM doesn’t use their data for its own training. These considerations reportedly influenced Verisk’s choice of Amazon Bedrock and Claude Sonnet.
Verisk employed several strategies to manage costs in their production LLM system. They regularly evaluated various FM options and switched models as new options with better price-performance became available. By redesigning the solution to reduce FM calls and using non-FM solutions where possible (such as for redline generation), they improved both cost efficiency and accuracy. The hybrid approach—identifying changed sections with non-FM methods before sending to the FM—is a practical cost optimization pattern applicable to many document processing use cases.
The case study emphasizes that “good software development practices apply to the development of generative AI solutions too.” Verisk built a decoupled architecture with reusable components: the Mozart companion is exposed as an API (decoupled from frontend development), and the API itself consists of reusable components including common prompts, common definitions, retrieval service, embedding creation, and persistence service. This modularity enables both maintainability and potential reuse across other applications.
According to Verisk’s evaluation, the solution generates over 90% “good or acceptable” summaries based on expert grading. The business impact is described as reducing policy change adoption time from days or weeks to minutes. While these are vendor-reported metrics and should be interpreted with appropriate caution, the iterative evaluation methodology with domain experts provides some credibility to the accuracy claims.
It’s worth noting that this case study originates from an AWS blog post and naturally emphasizes AWS services. The 90%+ success rate is based on internal evaluation criteria that are not fully disclosed, and “good or acceptable” represents a relatively broad quality threshold. The case study also doesn’t discuss failure modes, edge cases, or how the system handles documents outside its training distribution. Additionally, while Verisk mentions planning to use Amazon Titan Embeddings V2 in the future, the current implementation details may already reflect older model choices. Future users should evaluate whether newer embedding and foundation models might provide improved performance for similar use cases.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.
Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.