Anzen, a small insurance company with under 20 people, leveraged LLMs to compete with larger insurers by automating their underwriting process. They implemented a document classification system using BERT and AWS Textract for information extraction, achieving 95% accuracy in document classification. They also developed a compliance document review system using sentence embeddings and question-answering models to provide immediate feedback on legal documents like offer letters.
Anzen is a small insurtech startup with fewer than 20 employees that tackles the $23 billion annual problem of employment-related lawsuits (employees suing employers). Their solution combines traditional insurance coverage with a software product aimed at helping companies avoid lawsuits in the first place. The presenter, Cam Feenstra, shared insights from approximately 18 months of experience at the company, discussing how a small team can leverage LLMs to “punch above their weight” and compete with much larger insurance incumbents.
The talk was framed around the inherent challenges small companies face when competing with larger organizations: limited capital, smaller teams, less data, and fewer network effects. However, Anzen has sought to capitalize on advantages in agility and faster adoption of new technologies like LLMs, which larger companies may be slower to deploy due to bureaucracy or “if it’s not broke don’t fix it” attitudes.
Insurance underwriting involves evaluating companies seeking coverage—reviewing financials, litigation history, and various risk factors—to determine whether to write a policy and at what price. The traditional workflow involves:
Making this process as frictionless as possible for brokers is critical since brokers work with many clients and many insurance carriers simultaneously—if the process is difficult, they simply take their business elsewhere.
Anzen implemented an automated pipeline with two main components:
Document Classification: They built a classifier based on Google’s BERT model to determine whether incoming attachments are insurance applications. Starting from essentially random baseline performance, they:
The presenter emphasized the remarkable efficiency gain here: what would previously have required training a model from scratch (at least an order of magnitude more effort and data) was accomplished in under a week end-to-end. The 300 labeled samples were sufficient for production-quality performance.
Information Extraction: For extracting specific fields from applications (company name, financial details, etc.), they used AWS Textract’s question-answering feature. This API allows queries like “What is the company applying for insurance?” and extracts answers from PDF documents. While AWS doesn’t disclose implementation details, the performance suggests LLM technology under the hood.
The automated pipeline works as follows:
This replaced what would traditionally require a dedicated team just for intake and preprocessing.
Anzen hypothesized that entrepreneurs and startups could benefit from automated compliance review of HR documents (offer letters, separation agreements) rather than relying solely on expensive legal review. The key requirement was high accuracy—they explicitly wanted to avoid “bogus outputs” that might mislead users on compliance matters.
The solution uses a multi-step pipeline for each document type:
Feature Extraction Architecture:
Notably, they found that MiniLM (not technically a large language model) outperformed larger models for the semantic search component—a reminder that “larger” isn’t always better for specific tasks.
The system was prototyped with just a few dozen examples of each document type, with manually written prompts for feature extraction. While the presenter noted it was too early to confirm the hypothesis, the prototype was built in approximately one week and initial feedback has been positive.
The presenter emphasized that deploying LLMs in production still requires strong infrastructure expertise:
Evaluation metrics are described as “very important” for understanding if models perform as well in production as in testing and whether performance degrades over time. However, the presenter was candid that at their current scale, evaluation is largely manual:
This represents an honest assessment that startups often must balance ideal practices against resource constraints, with more automated evaluation planned as they scale.
An important operational insight: when using external APIs, iteration costs can be significant or even prohibitive. Since business logic on top of model outputs requires frequent changes, and testing different prompts or configurations requires repeated API calls, these costs must be factored into planning. This is a practical consideration often overlooked in proof-of-concept work.
During the Q&A, the presenter discussed the unique challenges of debugging LLM-based systems:
Anzen has been cautious about deploying generative models like GPT-4 in production, focusing instead on proven classification and extraction approaches. However, they’re exploring:
The presenter offered balanced perspectives on when LLMs are and aren’t appropriate:
The discussion also touched on the fundamental challenge of output reliability: no amount of prompt engineering can force models to produce exact desired formats, and experiments with AI agent loops (popular at the time) often failed to produce consistent results.
The case study demonstrates that small teams can effectively deploy LLMs by:
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.
Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.