Vouch Insurance implemented a production machine learning system using Metaflow to handle risk classification and document processing for their technology-focused insurance business. The system combines traditional data warehousing with LLM-powered predictions, processing structured and unstructured data through hourly pipelines. They built a comprehensive stack that includes data transformation, LLM integration via OpenAI, and a FastAPI service layer with an SDK for easy integration by product engineers.
Vouch Insurance is a company that provides business insurance specifically tailored for technology startups and other innovator-focused businesses. This case study, presented by Emily (Senior Machine Learning Engineer at Vouch) during a Metaflow Office Hours session, describes how the company has implemented LLM-powered solutions in production for two primary use cases: risk classification in underwriting and document AI processing. The presentation offers an honest look at their architecture, implementation choices, and lessons learned from running LLMs in production.
Insurance is fundamentally a document-intensive industry with significant potential for AI and machine learning applications. Vouch identified two key areas where LLMs could provide value:
The first use case involves risk classification, which is central to the underwriting business. Insurers need to assess risks and understand whether potential customers fall within their appetite to insure. Traditional approaches to risk classification can be labor-intensive and may not fully leverage the available data.
The second use case revolves around document AI. Insurance companies deal with numerous documents containing valuable information—both business transaction documents and publicly available information on the web that can help better understand customers. Extracting structured information from these documents (typically PDFs, though not exclusively) is a natural fit for LLM-based solutions.
Vouch describes themselves as a “modern data stack company,” and their LLM infrastructure reflects this philosophy. The architecture integrates several components in a thoughtful pipeline design:
The team started with the AWS Batch Terraform template provided by Metaflow and extended it for their specific needs. One notable extension was integrating AWS Cognito for user authentication at the Application Load Balancer (ALB) level, allowing Vouch users to sign in via Gmail. Connor, one of the team members who contributed to this work, mentioned that they forked the Terraform module to add this capability, which required various backend changes to support the authentication flow.
The overall flow follows this pattern:
Data preparation begins with Metaflow running data transformations orchestrated through DBT. Once data is prepared, it is sent to LLM providers (OpenAI being the first they tried, though the presentation notes they are not exclusively tied to OpenAI). Predictions are generated based on this data.
Post-processing is a critical step that the team emphasizes. When LLM responses come back, they “often still need a fair bit of work.” The Metaflow pipelines handle additional transformations to enforce structure when output parsers fail or don’t work entirely as expected. This is an honest acknowledgment of the reality of working with LLMs in production—they don’t always return perfectly structured responses.
The final predictions are written to a PostgreSQL database, served through a FastAPI instance, and also reverse-ETL’d back to Snowflake for reporting on prediction quality and performance.
A particularly thoughtful aspect of the architecture is the investment in developer experience. The team built a custom SDK that allows product engineers to retrieve predictions with just a couple lines of code, abstracting away the complexity of the underlying LLM infrastructure.
For developers working on the LLM pipelines themselves, the team uses Docker Compose to spin up the entire service locally, including pipelines, API, and databases. This containerized approach was adopted specifically to address cross-platform development challenges, particularly issues with Mac AMD architectures across different developers’ machines.
The team runs different execution patterns for different use cases. Risk classification pipelines run every hour, checking for new data that needs processing. Document AI workflows run on an as-needed basis, triggered when documents hit their services.
Being a startup, Vouch works at a scale of “terabytes or hundreds of terabytes” for the tables involved in feature engineering. The data is a mix of structured and semi-structured numeric and text data, plus documents (primarily PDFs).
The team implements several strategies to manage LLM API costs:
Prediction caching is used to avoid redundant API calls. Before making an LLM call, the system checks whether a prediction already exists, which helps narrow down the amount of work required. This is explicitly described as important because “all those calls are expensive.”
Token management is implemented, though the presenter notes this is a common pattern covered in educational content.
The hourly cadence of the risk classification pipeline works well for their scale—they haven’t encountered overwhelming volumes at this frequency.
The team uses LangChain to make calls to OpenAI APIs. Longer timeouts are configured on steps that involve LLM calls, acknowledging the inherent latency variability of external API calls.
The presentation includes several candid observations that are valuable for others building similar systems:
The AWS Batch Terraform template was praised as “really great for getting us up and running and into production” with the statement that “nothing beats that.” However, as the project matured, the team realized they probably need event-driven pipelines. While examples exist in the Metaflow documentation, the team expressed a desire for more comprehensive examples that don’t have gaps.
The local development experience across different machines and architectures proved challenging enough that they moved to a fully containerized development environment. While this approach has “quirks,” it helps insulate the team from platform-specific issues. The presenter specifically called out interest in hearing how others have handled these problems.
One team member (Sam) mentioned learning extensively about micro Mamba and the Netflix Metaflow extension, noting that recent Metaflow releases have improved the developer experience.
The presentation occurred in a community setting (Metaflow Office Hours), and several Vouch team members participated. This suggests an organization that values community engagement and knowledge sharing. The presenter mentioned taking the Outerbounds (the company behind Metaflow) course and finding the transition from the class to the community “smooth and welcoming.”
The Q&A portion of the presentation provided additional context about potential future improvements, including the new @pypi decorator that could simplify package management and Kubernetes-based event-driven triggering options that could replace or supplement the current polling-based approach.
This case study represents a practical, production-ready approach to LLMOps in the insurance industry. The architecture shows thoughtful consideration of:
The honest acknowledgment of challenges—particularly around local development, event-driven architectures, and LLM output parsing—adds credibility to the case study. This is not a polished marketing piece but rather a practitioner’s view of what it actually takes to run LLMs in production for a real business use case.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.
Ramp, a finance automation platform serving over 50,000 customers, built a comprehensive suite of AI agents to automate manual financial workflows including expense policy enforcement, accounting classification, and invoice processing. The company evolved from building hundreds of isolated agents to consolidating around a single agent framework with thousands of skills, unified through a conversational interface called Omnichat. Their Policy Agent product, which uses LLMs to interpret and enforce expense policies written in natural language, demonstrates significant production deployment challenges and solutions including iterative development starting with simple use cases, extensive evaluation frameworks, human-in-the-loop labeling sessions, and careful context engineering. Additionally, Ramp built an internal coding agent called Ramp Inspect that now accounts for over 50% of production PRs merged weekly, illustrating how AI infrastructure investments enable broader organizational productivity gains.