Snorkel: Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

LLMOps Database

Insurance

Snorkel

Company

Snorkel

Title

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Industry

Insurance

Link

https://snorkel.ai/blog/building-the-benchmark-inside-our-agentic-insurance-underwriting-dataset/

Year

2025

Summary (short)

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

meta

This case study details Snorkel's development of a comprehensive benchmark for evaluating AI agents in the specialized domain of commercial property and casualty insurance underwriting. The work represents a significant LLMOps initiative focused on understanding how large language models perform in production-like scenarios that require domain expertise, multi-tool integration, and complex reasoning over proprietary knowledge. **Company and Use Case Overview** Snorkel AI, a company specializing in data-centric AI solutions, created this benchmark as part of their ongoing research into enterprise AI applications. The use case centers on developing an AI copilot system that assists junior underwriters in making complex insurance decisions. The system was designed to simulate realistic enterprise scenarios where AI agents must navigate specialized knowledge bases, interact with users through multi-turn conversations, and use multiple tools to solve complex business problems. The benchmark was developed in collaboration with Snorkel's Expert Data-as-a-Service network, specifically leveraging Chartered Property and Casualty Underwriters (CPCUs) to ensure the realism and validity of the scenarios. This collaboration highlights a key aspect of successful LLMOps implementations: the critical importance of domain expertise in developing, evaluating, and deploying AI systems in specialized fields. **Technical Architecture and Implementation** The AI copilot system was built using LangGraph with Model Context Protocol (MCP), demonstrating a modern approach to agentic AI development. The architecture wrapped various AI models as ReAct agents, providing a standardized interface for testing different foundation models including both open-source and proprietary options. This architectural choice reflects best practices in LLMOps where system design should be model-agnostic to allow for easy comparison and potential model switching. The system architecture included several key components that are representative of production AI systems. The AI agents had access to multiple tools including databases with several tables containing business information, free-text underwriting guidelines, and metadata about available resources. The agents were required to perform complex multi-step reasoning tasks that involved querying databases, interpreting guidelines, and engaging in multi-turn conversations with users to gather necessary information. **Complex Multi-Tool Integration and Reasoning** One of the most technically challenging aspects of the system was the requirement for agents to perform complex chains of tool usage. For example, to determine whether a business qualifies as a small business, the AI copilot needed to first find the proper NAICS classification code from the 2012 version of the schema, then use this code to query a table from the US Small Business Administration to determine both which feature of the business to use for qualification and the appropriate threshold values. This required the agent to navigate between different database tables and handle the complexity of evolving classification systems, as the agents only had primary access to 2022 NAICS codes and had to use mapping tables to access the required 2012 versions. This type of multi-step reasoning with tool chaining represents a significant challenge in production AI systems. The agents had to maintain context across multiple tool calls, handle potential errors in tool usage, and adapt their approach based on the results of previous queries. The benchmark revealed that even frontier models struggled with this level of complexity, with tool use errors occurring in 36% of conversations across all models tested. **Evaluation Framework and Performance Metrics** Snorkel implemented a comprehensive evaluation framework using their evaluation suite, measuring multiple dimensions of performance including task solution correctness, task solution conciseness, tool use correctness, and tool use efficiency. This multi-faceted approach to evaluation reflects best practices in LLMOps where simple accuracy metrics are insufficient for understanding system performance in production environments. The evaluation revealed significant performance variations across frontier models, with accuracies ranging from single digits to approximately 80%. Interestingly, the study found a clear tradeoff between test-time compute and accuracy, with the highest performing models consuming significantly more output tokens. This finding has important implications for production deployments where computational costs must be balanced against performance requirements. **Performance Analysis and Error Modes** The benchmark uncovered several critical error modes that are highly relevant for LLMOps practitioners. Tool use errors were surprisingly common, occurring in 36% of conversations even among top-performing models. Despite having access to metadata required to use tools properly, models frequently made basic errors such as writing SQL queries without first checking table schemas. This finding suggests that current foundation models may require additional engineering support or fine-tuning to handle complex tool ecosystems reliably. Perhaps more concerning from a production perspective was the discovery of hallucinations based on pretrained domain knowledge. The highest performing models from OpenAI were found to hallucinate insurance products not contained in the provided guidelines 15-45% of the time, depending on the specific model. These hallucinations were particularly insidious because they led to misleading questions to users and could result in catastrophic factual inaccuracies in production systems. **Task-Specific Performance Patterns** The evaluation revealed interesting patterns across different task types. Business classification tasks using 2022 NAICS codes were among the easiest, achieving 77.2% accuracy averaged across models. Policy limits and deductibles were also relatively successful at 76.2% and 78.4% respectively, primarily because underwriting guidelines contained default values applicable to many scenarios. The most challenging tasks were appetite checks (61.5% accuracy) and product recommendations (37.7% accuracy), which required agents to use multiple tools in sequence and compose their results correctly. These tasks forced models to probe users for information while navigating complex tool chains, representing the type of sophisticated reasoning required in many production AI applications. **Multi-Turn Conversation and User Interaction** The benchmark explicitly included multi-turn conversation capabilities, recognizing that production AI systems must be able to gather information from users iteratively. The study found significant correlations between accuracy and both the number of turns agents had with users and the number of tools used. However, there were notable exceptions where models took many turns but still achieved poor accuracy, suggesting that the ability to ask the right questions is as important as the ability to use tools correctly. This finding highlights a key challenge in LLMOps: developing systems that can engage in meaningful dialogue with users while maintaining focus on solving specific business problems. The benchmark revealed that some models struggled to ask appropriate questions even when they could use tools correctly, indicating that conversational abilities and tool use capabilities may require different types of training or fine-tuning. **Domain Expertise and Proprietary Knowledge** One of the most significant insights from this case study is the critical importance of domain expertise and proprietary knowledge in production AI systems. The benchmark was specifically designed to test models on information they had never seen before, simulating the reality of enterprise deployments where AI systems must work with company-specific data and processes. The expert network of CPCUs was essential for ensuring the realism and validity of the benchmark scenarios. Experts provided feedback on individual data samples, overall guidelines, business rules, and appropriate responses. This collaborative approach between AI practitioners and domain experts represents a best practice for LLMOps implementations in specialized fields. **Implications for Production AI Systems** The findings from this benchmark have several important implications for organizations deploying AI systems in production. First, the high rate of tool use errors suggests that current foundation models may require additional engineering support, such as more sophisticated prompt engineering or custom fine-tuning, to handle complex tool ecosystems reliably. Second, the discovery of domain-specific hallucinations indicates that organizations need robust evaluation frameworks that can detect when models are drawing on inappropriate training data rather than using provided context. This is particularly critical in regulated industries like insurance where factual accuracy is paramount. Third, the performance variations across tasks suggest that organizations should expect uneven performance when deploying AI systems across different business processes, even within the same domain. This may require different deployment strategies or additional training for different types of tasks. **Technical Lessons for LLMOps** This case study provides several technical lessons for LLMOps practitioners. The use of LangGraph with Model Context Protocol demonstrates how modern agent frameworks can provide flexibility for working with different models while maintaining consistent interfaces. The comprehensive evaluation framework shows the importance of measuring multiple dimensions of performance rather than relying on simple accuracy metrics. The discovery that even frontier models struggle with complex multi-tool scenarios suggests that current approaches to agent development may need to incorporate more sophisticated planning and error recovery mechanisms. The finding that models can perform tool use correctly but still struggle with asking appropriate questions indicates that conversational abilities and technical capabilities may need to be developed separately. **Future Directions and Ongoing Challenges** The case study concludes with observations about the ongoing challenges in developing AI systems for specialized domains. While the tasks in the benchmark are complex, requiring multiple tools and nuanced reasoning, they are still significantly less complex than many academic benchmarks. This suggests that the challenges in production AI deployment are often more about handling specialized knowledge and user interaction than about raw reasoning capabilities. The authors note that developing AI systems for specialized domains requires careful evaluation and development with benchmark data that contains skills relevant to the specific domain. This reinforces the importance of domain expertise and custom evaluation frameworks in successful LLMOps implementations. Overall, this case study provides valuable insights into the challenges and opportunities in deploying AI agents in production environments, particularly in specialized domains that require complex reasoning over proprietary knowledge. The findings have important implications for how organizations approach AI deployment, evaluation, and ongoing maintenance in enterprise settings.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source