Unify: Building and Evaluating Legal AI with Multi-Modal Evaluation Systems

LLMOps Database

Legal

Unify

Company

Unify

Title

Building and Evaluating Legal AI with Multi-Modal Evaluation Systems

Industry

Legal

Link

https://www.youtube.com/watch?v=kuXtW03cZEA

Year

2025

Summary (short)

Harvey, a legal AI company, has developed a comprehensive approach to building and evaluating AI systems for legal professionals, addressing the unique challenges of document complexity, nuanced outputs, and high-stakes accuracy requirements. Their solution combines human-in-the-loop evaluation with automated model-based assessments, custom benchmarks like BigLawBench, and a "lawyer-in-the-loop" product development philosophy that embeds legal domain experts throughout the engineering process. The company has achieved significant scale with nearly 400 customers globally, including one-third of the largest 100 US law firms, demonstrating measurable improvements in evaluation quality and product iteration speed through their systematic LLMOps approach.

Tags

high_stakes_application

regulatory_compliance

## Company Overview and Use Case Harvey is a legal AI company that provides a comprehensive suite of AI-powered tools specifically designed for legal professionals and law firms. The company offers multiple product categories including general-purpose assistants for document drafting and summarization, large-scale document extraction tools, and domain-specific agents and workflows. Harvey's vision centers on two key objectives: enabling legal professionals to perform all their work within the Harvey platform, and making Harvey available wherever legal work occurs. The company has achieved considerable market penetration, serving just under 400 customers across all continents (except Antarctica), with particularly strong adoption among large US law firms where one-third of the largest 100 firms and eight out of the top 10 largest firms use Harvey. This scale of deployment represents a significant real-world validation of their LLMOps approach in a highly regulated and risk-sensitive industry. ## Technical Challenges and Domain Complexity The legal domain presents unique challenges that make it particularly demanding for LLMOps implementation. Legal professionals work with extremely complex documents that can span hundreds or thousands of pages, often existing within large corpora of interconnected case law, legislation, and case-related materials. These documents frequently contain extensive cross-references and exhibit significant structural complexity including handwriting, scanned notes, multi-column layouts, embedded tables, and multiple mini-pages within single documents. The outputs required by legal professionals are equally complex, ranging from long-form text and intricate tables to diagrams and charts for reports, all requiring the sophisticated legal language that professionals expect. The stakes are exceptionally high, as mistakes can have career-impacting consequences, making verification and accuracy paramount. This goes beyond simple hallucination detection to include the more nuanced challenge of identifying slightly misconstrued or misinterpreted statements that may be factually incorrect in context. Quality assessment in the legal domain is inherently subjective and nuanced. Harvey's presentation included an example where two responses to the same document understanding question about materiality scrape and indemnification were both factually correct and free of hallucinations, yet one was strongly preferred by in-house lawyers due to additional nuance and detailed definitions. This illustrates the difficulty of automatically assessing quality and the critical importance of human judgment in evaluation processes. ## Product Development Philosophy and LLMOps Integration Harvey has developed a distinctive product development approach that tightly integrates evaluation into the development process. Their philosophy rests on three core principles that directly impact their LLMOps implementation: **Applied Company Focus**: Harvey emphasizes that success requires combining state-of-the-art AI with best-in-class user interfaces. This principle recognizes that having the best AI technology alone is insufficient; the AI must be packaged and delivered in ways that meet customers where they are and help solve real-world problems. **Lawyer-in-the-Loop**: This principle represents perhaps the most significant aspect of Harvey's LLMOps approach. The company embeds lawyers at every stage of the product development process, recognizing that the incredible complexity and nuance in legal work requires domain expertise and user empathy. Lawyers work side-by-side with engineers, designers, and product managers on all aspects of product development, from identifying use cases and data set collection to evaluation rubric creation, UI iteration, and end-to-end testing. This integration extends to go-to-market activities, where lawyers participate in customer demos, collect feedback, and translate customer needs back to product development teams. **Prototype Over PRD**: Rather than relying heavily on product requirement documents or specifications, Harvey believes that building great products happens through frequent prototyping and iteration. They have invested significantly in building their own AI prototyping stack to enable rapid iteration on prompts, algorithms, and user interfaces. ## Evaluation Framework and Methodologies Harvey's evaluation approach encompasses three primary methodologies, each addressing different aspects of their LLMOps needs: **Human Preference Judgments**: Despite advances in automated evaluation, human preference judgments remain Harvey's highest quality signal. The company has invested heavily in improving the throughput and streamlining operations to collect this data efficiently, enabling them to run more evaluations more quickly at lower cost. Their classic side-by-side evaluation tool presents human raters with standardized query datasets representing common customer questions, asking them to evaluate two responses to the same query. Raters provide both relative preferences and absolute ratings on scales (typically 1-7), along with qualitative feedback. **Model-Based Auto Evaluations**: Harvey implements LLM-as-a-judge systems to approximate the quality of human review, though they acknowledge the significant challenges this presents in real-world complexity. They developed their own evaluation benchmark called BigLawBench, which contains complex open-ended tasks with subjective answers that more closely mirror real-world legal work, in contrast to academic benchmarks like LegalBench that focus on simple yes/no questions. **Component-Based Evaluation**: For complex multi-step workflows and agents, Harvey breaks problems down into evaluable components. Using RAG as an example, they separately evaluate query rewriting, chunk retrieval and matching, answer generation from sources, and citation creation. This decomposition makes automated evaluation more tractable and enables more targeted improvements. ## Custom Benchmarking and Evaluation Tooling Harvey's development of BigLawBench represents a significant contribution to legal AI evaluation. Unlike academic benchmarks that typically feature simple questions with straightforward answers, BigLawBench includes complex open-ended tasks requiring subjective judgment that reflect actual legal work. For example, instead of asking simple yes/no questions about hearsay, BigLawBench might ask users to "analyze these trial documents and draft an analysis of conflicts, gaps, contradictions," expecting paragraphs of nuanced text as output. To enable automated evaluation of such complex outputs, Harvey has developed sophisticated rubric systems that break evaluation into multiple categories: - **Structure**: Assessing whether responses are formatted correctly (e.g., as tables with specific columns) - **Style**: Evaluating whether responses emphasize actionable advice or meet other stylistic requirements - **Substance**: Checking whether responses accurately reference and incorporate facts from relevant documents - **Accuracy**: Identifying hallucinations or misconstrued information Each evaluation criterion is crafted by Harvey's in-house domain experts and is distinct for each question-answer pair, representing significant investment in evaluation infrastructure. ## Production Deployment and Model Integration Harvey's approach to model deployment is illustrated through their integration of GPT-4.1 in April. Their systematic evaluation process demonstrates mature LLMOps practices: **Initial Assessment**: They first ran BigLawBench to get a rough quality assessment, finding that GPT-4.1 performed better than other foundation models within Harvey's AI systems. **Human Evaluation**: They conducted extensive human rater evaluations, comparing their baseline system with the new GPT-4.1-based system using their established 1-7 rating scale. Results showed the new system skewing significantly toward higher quality ratings. **Additional Testing**: Beyond initial positive results, they ran extensive additional tests on product-specific datasets to understand performance characteristics and identify potential shortcomings. **Internal Validation**: They conducted internal dogfooding to collect qualitative feedback from in-house teams, which helped identify regressions such as GPT-4.1's tendency to start responses with "Certainly!" which was off-brand and inappropriate for legal contexts. ## Tooling and Infrastructure Harvey has made significant investments in evaluation tooling and infrastructure. They leverage LangSmith extensively for routine evaluations, particularly those involving task decomposition, while building custom tools for human-rater-focused evaluations. This mixed approach allows them to optimize different evaluation workflows appropriately. Their investment in tooling has paid dividends, with improvements in evaluation capabilities leading to increased adoption across teams, more frequent evaluation usage, improved iteration speed, enhanced product quality, and greater confidence in product quality that enables faster customer deployment. ## Technical Architecture and Workflow Examples Harvey's technical approach is exemplified in their workflow development process. When building a new workflow (such as drafting client alerts), their process involves: **Domain Expert Input**: Lawyers provide initial context about the document type, its typical use cases, and when it appears in daily legal work. **Collaborative Development**: Lawyers collaborate with engineers and product teams to build algorithms and evaluation datasets. **Iterative Prototyping**: Engineers build prototypes that undergo multiple iterations based on output evaluation and expert feedback. **Product Integration**: Parallel development of production-ready features embedded in the main product with UI iteration capabilities. This approach has enabled Harvey to build dozens of workflows efficiently while maintaining high quality standards. ## Multi-Modal Document Processing Harvey's system handles the significant complexity of legal document processing, including documents with handwriting, scanned notes, multi-column layouts, embedded tables, and various other formatting challenges. Their large-scale document analysis tools can process hundreds or thousands of documents simultaneously, outputting results to tables or summaries, which saves hours or weeks of manual work typically required for due diligence or legal discovery tasks. ## Agent Capabilities and Future Direction Harvey has incorporated agentic capabilities including multi-step agentic search, enhanced personalization and memory, and the ability to execute long-running tasks. The company indicated significant additional agent capabilities in development. Looking forward, Harvey identifies process data as crucial for advancing domain-specific agentic workflows. They argue that while AI progress has historically relied on increasing amounts of publicly available data for larger models, building real-world agentic systems requires process data showing how work gets done within organizations. Using M&A transactions as an example, they note that such work spans months or years, involves hundreds of subtasks, and often lacks written playbooks, with critical knowledge captured in informal conversations or handwritten notes. Extracting and applying this process data represents what they see as the next breakthrough opportunity for agentic systems. ## Key Learnings and Best Practices Harvey's experience has yielded several important insights for LLMOps practitioners: **Engineering Focus**: They emphasize that evaluation is fundamentally an engineering problem, and investment in strong tooling, processes, and documentation pays back significantly. Their 10x improvement in evaluation capabilities demonstrates the value of this approach. **Balance of Rigor and Judgment**: While rigorous and repeatable evaluations are critical for product progress, human judgment, qualitative feedback, and taste remain equally important. Harvey continuously makes improvements based on qualitative feedback that don't necessarily impact evaluation metrics but clearly improve the product through enhanced speed, consistency, or usability. **Tool Selection Flexibility**: Their mixed approach using both LangSmith and custom-built tools demonstrates the value of evaluating different solutions and selecting appropriate tools for specific use cases rather than adopting a one-size-fits-all approach. The case study represents a mature example of LLMOps implementation in a domain with extreme quality requirements, demonstrating how systematic evaluation approaches, domain expert integration, and significant tooling investment can enable successful production AI deployment at scale in highly regulated environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source