ZenML

Building Robust Legal Document Processing Applications with LLMs

Anzen 2023
View original source

The case study explores how Anzen builds robust LLM applications for processing insurance documents in environments where accuracy is critical. They employ a multi-model approach combining specialized models like LayoutLM for document structure analysis with LLMs for content understanding, implement comprehensive monitoring and feedback systems, and use fine-tuned classification models for initial document sorting. Their approach demonstrates how to effectively handle LLM hallucinations and build production-grade systems with high accuracy (99.9% for document classification).

Industry

Insurance

Technologies

Overview

This presentation by Cam Featstra from Anzen provides a comprehensive practitioner’s guide to building robust production applications using generative AI models, particularly in domains where correctness is paramount. Anzen operates in the insurance industry, processing insurance applications at scale, making accuracy a non-negotiable requirement. The talk acknowledges both the promise and limitations of current LLM technology, offering practical strategies for deploying these models in production while mitigating the well-known hallucination problem.

The presentation opens with a humorous reference to lawyers who were fined for using ChatGPT-generated legal citations that turned out to be fabricated—a cautionary tale that underscores the central challenge the talk addresses. While creative writing applications can tolerate some degree of model “creativity,” insurance document processing and similar high-stakes domains require a much more rigorous approach.

Understanding the Hallucination Problem

The speaker emphasizes that hallucination is not a new problem, referencing a 2018 paper titled “Actively Avoiding Nonsense in Generative Models.” Much of what we call hallucination stems from out-of-distribution queries—asking models for information that wasn’t in their training data. This is fundamentally similar to challenges with any predictive model, including simple linear regression, though quantifying what’s “out of distribution” for language models is considerably more complex since the same semantic content can be expressed in countless ways.

Beyond pure hallucination (making things up entirely), the presentation notes that LLMs can be “wrong” in other ways due to how they process tokens. The example given is ChatGPT incorrectly stating that “mustard” contains two letter “n”s when it contains none—a quirk of tokenization rather than lack of knowledge.

An important operational consideration highlighted is that third-party models like those from OpenAI are constantly changing under the hood. The speaker references research showing GPT-4’s performance on certain math questions dropped dramatically over a few months while GPT-3.5 improved significantly during the same period. This model drift creates significant challenges for production systems, as software that performs well at deployment may suddenly degrade without any changes to the application code.

Practical Strategies for Production LLM Systems

Structured Outputs with Function Calls

One of the most actionable recommendations is the use of function calls (or similar structured output mechanisms) to constrain model responses. The presentation demonstrates this with a recipe generation example. When simply asking ChatGPT to output JSON for a recipe, the model complies but uses an arbitrary schema. Adding detailed prompt instructions for specific fields is tedious and unreliable.

Function calls solve this elegantly by defining the exact schema expected. The model then returns data conforming precisely to that structure. This is both more reliable and requires less prompt engineering than trying to specify format requirements in natural language. For production applications that need to parse and process LLM outputs programmatically, this is essential.

Retrieval-Augmented Generation and Context Management

The presentation addresses the fundamental issue that many hallucinations occur when models lack the necessary information to answer correctly. The solution is to provide relevant context as part of the prompt. However, this introduces new challenges around context length limits and token costs.

Several strategies are discussed for managing context efficiently:

Document Preprocessing: The Anzen Example

A concrete example from Anzen’s production system illustrates the importance of data quality. When processing insurance applications (which typically contain complex layouts with checkboxes, tables, and varied formatting), naive OCR produces poor results that lose the structural relationships between questions and answers.

Anzen’s solution employs LayoutLM, a specialized model for understanding document layouts, to first identify which parts of a document are questions versus answers. This semantic understanding is combined with OCR results to reconstruct a text representation that preserves the logical structure of the original document. Only this cleaned, structured representation is then provided to the LLM for further processing.

This exemplifies the broader principle of “garbage in, garbage out”—investing significant effort in data preparation before engaging the LLM pays dividends in output quality. The speaker emphasizes this is not a new concept but remains critically important.

Strategic Use of Fine-Tuned Classification Models

A key architectural recommendation is to use generative models only for tasks that truly require generative capabilities. For classification tasks, fine-tuned smaller models trained on curated datasets often outperform general-purpose LLMs while being faster and cheaper.

At Anzen, rather than asking an LLM to both identify document types and extract information, a dedicated classification model first determines whether a document is an insurance application. Only confirmed insurance applications proceed to the more expensive LLM processing step. This separation of concerns constrains the problem space for the LLM and has achieved 99.9% accuracy across multiple document types, according to the speaker.

The presentation notes that effective fine-tuned classifiers don’t require massive datasets—a few hundred examples can yield good performance, and a few thousand can produce excellent results.

System Architecture for Production Reliability

Feedback Mechanisms

The speaker stresses that any production AI system must incorporate feedback mechanisms. Given that third-party models can change behavior without notice and input data distributions evolve over time, detecting performance degradation quickly is essential.

The ideal implementation is a first-class product feature allowing users to report incorrect results. This feeds into dashboards and alerting systems that enable engineering teams to identify and respond to problems rapidly. Usage metrics can serve as a rough proxy, but explicit feedback about correctness is far more valuable.

Beyond immediate issue detection, feedback data becomes a valuable asset for continuous improvement. It can be used to fine-tune models, identify specific failure modes, and create a positive feedback loop where the system improves over time.

Comprehensive Monitoring

Beyond user feedback, robust production systems require comprehensive monitoring infrastructure:

This monitoring enables rapid response when issues arise, including the ability to roll back changes if necessary.

Honest Assessment of Current Limitations

The presentation maintains a realistic perspective on the current state of LLM technology. The speaker explicitly notes that open-source models at the time of the presentation didn’t meaningfully compete with OpenAI’s offerings for production use cases. While acknowledging that models continue to improve rapidly, the core message is that “throwing a bunch of information at ChatGPT” isn’t sufficient for production applications, especially in domains requiring correctness.

The speaker acknowledges the rapidly evolving landscape, noting that even in the weeks between preparing the presentation and delivering it, content needed updating. This underscores the importance of robust architectures and monitoring—techniques and capabilities that work today may need adjustment as the underlying technology evolves.

Key Takeaways

The fundamental principle articulated is that while you cannot solve hallucination at the model level, you can build reliable applications by:

The Anzen case study demonstrates these principles in action within the insurance industry, where the combination of specialized document processing (LayoutLM, OCR), fine-tuned classifiers, and carefully architected LLM pipelines achieves production-grade reliability on complex document understanding tasks.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus 2025

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

document_processing question_answering summarization +45

Multi-Agent AI System for Financial Intelligence and Risk Analysis

Moody’s 2025

Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.

fraud_detection document_processing question_answering +42