Harvey: Scaling AI Evaluation for Legal AI Systems Through Multi-Modal Assessment

LLMOps Database

Legal

Harvey

Company

Harvey

Title

Scaling AI Evaluation for Legal AI Systems Through Multi-Modal Assessment

Industry

Legal

Link

https://www.harvey.ai/blog/scaling-ai-evaluation-through-expertise

Year

2025

Summary (short)

Harvey, a legal AI company, developed a comprehensive evaluation strategy for their production AI systems that handle complex legal queries, document analysis, and citation generation. The solution combines three core pillars: expert-led reviews involving direct collaboration with legal professionals from prestigious law firms, automated evaluation pipelines for continuous monitoring and rapid iteration, and dedicated data services for secure evaluation data management. The system addresses the unique challenges of evaluating AI in high-stakes legal environments, achieving over 95% accuracy in citation verification and demonstrating statistically significant improvements in model performance through structured A/B testing and expert feedback loops.

## Overview Harvey represents a sophisticated approach to building and evaluating AI systems for the legal industry, where accuracy and trustworthiness are paramount. This case study demonstrates how a legal AI company has developed comprehensive evaluation methodologies to ensure their production systems meet the rigorous standards required for professional legal work. Harvey provides AI-powered assistance for legal professionals across multiple domains including document analysis, legal research, contract review, and regulatory compliance. The company's approach to LLMOps is particularly noteworthy because it addresses the unique challenges of deploying AI in high-stakes professional environments where errors can have significant consequences. Their evaluation strategy serves as a model for how organizations can systematically assess and improve AI performance in specialized domains through a combination of expert knowledge, automated testing, and robust data management practices. ## The Challenge of Evaluating Legal AI Harvey's evaluation challenge stems from the complexity and high-stakes nature of legal work. When a tax attorney queries the system about multinational tax implications, the response must not only be accurate but also properly sourced with relevant citations to tax codes and court interpretations. Unlike general-purpose AI applications, legal AI systems must meet professional standards where incorrect information can lead to serious consequences for clients and practitioners. The company identified several key evaluation challenges that are common in specialized AI applications. Traditional automated metrics often fail to capture the nuanced requirements of professional work, while purely manual evaluation approaches cannot scale to cover the breadth of use cases and continuous iteration cycles required in production environments. Additionally, the specialized nature of legal work requires domain expertise that is both expensive and limited in availability. ## Three-Pillar Evaluation Strategy Harvey's solution centers around a three-pillar evaluation framework that balances depth of expertise with scalability and operational efficiency. This approach demonstrates sophisticated thinking about how to combine different evaluation methodologies to create a comprehensive assessment system. ### Expert-Led Reviews and Domain Collaboration The first pillar involves deep collaboration with legal professionals who provide domain-specific insights and uphold professional standards. What distinguishes Harvey's approach is the directness of this collaboration - rather than working through layers of abstraction such as consultants or account managers, Harvey engineers regularly interact directly with partners from prestigious law firms. This creates an unusually tight feedback loop where engineers can get firsthand insights from professionals whose time is typically reserved for high-stakes legal work. This direct collaboration extends to building expert-curated retrieval datasets, which are essential for evaluating the document retrieval components of their RAG (Retrieval-Augmented Generation) systems. Domain experts develop "golden" query sets that range from common user questions to highly nuanced legal challenges requiring deep expertise. For each query, experts identify the most relevant supporting documents, creating ground truth datasets that can be used to evaluate retrieval system performance. The retrieval evaluation uses standard information retrieval metrics including precision (proportion of relevant results), recall (coverage of relevant documents), and NDCG (Normalized Discounted Cumulative Gain) which measures whether important documents appear at the top of results. These metrics have proven to be highly predictive of real-world user satisfaction, providing a reliable signal for system improvements. Harvey also tests system performance under varying conditions of retrieval power and model context utilization, helping them understand tradeoffs between quality, speed, and cost. This is particularly important in agentic systems where retrieval is not a one-time step but an iterative process involving search, reflection, and context refinement. ### Structured Answer Quality Assessment Beyond document retrieval, Harvey has developed sophisticated methods for evaluating the quality of generated responses. They built an internal tool for side-by-side LLM comparisons that enables domain experts to assess responses in a structured, unbiased manner. The system supports two complementary evaluation protocols: A/B preference tests where experts choose between anonymized answers with randomized ordering, and Likert-scale ratings where experts independently rate answers on dimensions like accuracy, helpfulness, and clarity. These evaluation protocols incorporate important methodological controls to reduce bias, including randomized ordering, standardized prompts, and anonymized content. This allows Harvey to detect statistically significant improvements when modifying prompts, pipelines, or underlying models. A concrete example of this approach in action involved comparing GPT-4.1 with GPT-4o for complex legal questions. The evaluation revealed that GPT-4.1 significantly outperformed GPT-4o, with mean ratings improving by over 10% and median scores rising from 5 to 6 on a 7-point scale. This level of statistical rigor provides strong confidence for making infrastructure decisions about model deployment. ### Automated Evaluation Pipelines The third pillar addresses the limitations of expert-only evaluation through automated systems that enable continuous monitoring and rapid iteration. Harvey has developed automated evaluation pipelines that extend human feedback with data-driven methods, providing broader coverage without sacrificing depth or rigor. Their automated systems integrate legal domain knowledge to go beyond generic benchmarks and capture the specific demands of professional legal workflows. The evaluation process considers multiple elements: the model's output, the original user request, relevant domain documentation, and expert-provided prior knowledge. The system produces both a grade reflecting quality standards and a confidence score indicating the reliability of that assessment. These automated evaluations serve three core operational purposes. They run nightly canary evaluations to validate code changes before production deployment, catching regressions in sourcing accuracy, answer quality, and legal precision. They monitor anonymized production data to track performance trends while maintaining client confidentiality. And they evaluate newly released foundation models to identify performance gains and guide integration decisions. ## Specialized Technical Approaches ### Knowledge Source Identification System Harvey has developed specialized automated evaluation techniques for specific tasks, such as their Knowledge Source Identification system for verifying legal citations. This system addresses unique engineering challenges including high-volume fuzzy matching against millions of documents and proper weighting of metadata fields when citations are partial or ambiguous. The solution employs a custom embedding pipeline that prioritizes document title similarity and accounts for source context. The process begins with structured metadata extraction from citations, parsing details like title, source collection, volume, page range, author, and publication date. When reliable publication data exists, the system queries an internal database for candidate documents. For partial metadata, it uses embedding-based retrieval with date filters. Finally, an LLM performs binary document-matching evaluation to confirm whether retrieved candidates match the original citations. This multi-stage approach has achieved over 95% accuracy on attorney-validated benchmark datasets, demonstrating how specialized automated evaluation can achieve high precision for domain-specific tasks. ### Data Management and Infrastructure Harvey's evaluation infrastructure includes a dedicated data service that addresses the operational challenges of organizing, labeling, and versioning evaluation data. This service is isolated from Harvey's primary application to prevent data leakage while providing complete control over access, updates, and versioning. The system standardizes how inputs, outputs, and annotations are stored, ensuring consistency across legal experts, engineers, and automated evaluators. Fine-grained role-based access control enforces privacy policies at the row level, enabling data segmentation between public, confidential, and restricted tiers. This allows sensitive legal documents to remain under tight restrictions while enabling broader sharing of aggregate statistics and higher-level metrics. Dataset versioning is implemented as a core principle, with published evaluation collections becoming immutable to ensure reproducible comparisons across experiments. This approach enhances reproducibility and helps teams confirm that quality improvements result from deliberate changes rather than shifting datasets or annotation drift. ## Production Operations and Continuous Improvement Harvey's evaluation system operates continuously in production, providing ongoing monitoring and feedback for system improvements. The integration of automated pipelines with expert feedback creates a comprehensive quality assurance process that can catch both obvious errors and subtle quality degradations. The system's ability to run lightweight canary evaluations nightly demonstrates sophisticated operational maturity, catching regressions before they impact users. The combination of production monitoring with model vetting ensures that Harvey can rapidly adopt new foundation models while maintaining quality standards. The evaluation infrastructure also supports rapid iteration cycles essential for AI system development. By automating much of the evaluation process while maintaining expert oversight for critical decisions, Harvey can test hypotheses and deploy improvements more quickly than would be possible with purely manual evaluation approaches. ## Challenges and Considerations While Harvey's approach demonstrates sophisticated evaluation practices, the case study also highlights ongoing challenges in AI evaluation for specialized domains. The reliance on expert feedback, while valuable, introduces potential bottlenecks and scaling challenges as the system grows. The company acknowledges limitations including data scarcity for comprehensive evaluation, feedback latency from manual reviews, fragmented expertise across different legal domains, and regression risks without systematic large-scale metrics. The automated evaluation systems, while impressive, still require careful calibration and validation against expert judgment to ensure they capture the nuances of legal work accurately. The complexity of legal reasoning and the high stakes of professional applications mean that purely automated approaches are insufficient, requiring the hybrid human-AI evaluation approach Harvey has developed. ## Technical Architecture Insights The case study reveals several important architectural decisions that support Harvey's evaluation approach. The separation of evaluation data services from primary application infrastructure demonstrates good security and operational practices. The use of embedding-based retrieval for citation verification shows how modern NLP techniques can be applied to traditional legal research problems. The implementation of both real-time production monitoring and batch evaluation processes shows how different evaluation modes can serve different operational needs. The integration of structured metadata extraction with LLM-based matching demonstrates how hybrid approaches can achieve higher accuracy than purely automated or manual methods. ## Industry Impact and Best Practices Harvey's evaluation methodology provides a template for other organizations deploying AI in high-stakes professional environments. The emphasis on direct expert collaboration, rather than working through intermediaries, creates more effective feedback loops and better alignment between system capabilities and user needs. The combination of rigorous statistical methods with domain expertise shows how traditional evaluation approaches can be adapted for AI systems. The focus on reproducibility through dataset versioning and systematic experimental design demonstrates maturity in AI operations practices. The case study illustrates how specialized AI applications require evaluation approaches that go beyond general-purpose benchmarks and metrics. The development of domain-specific automated evaluation tools, validated against expert judgment, provides a path for scaling quality assurance in specialized AI applications. ## Future Directions and Implications Harvey's work points toward several important directions for AI evaluation in specialized domains. The challenge of evaluating multi-step reasoning and agentic workflows represents a frontier area where current evaluation methods may need extension. The question of how to automate domain expert reviews more effectively remains an active area of development. The case study demonstrates that successful AI evaluation in professional contexts requires investment in specialized infrastructure, close collaboration with domain experts, and sophisticated understanding of both AI capabilities and domain requirements. This level of investment may be necessary for AI applications where accuracy and reliability are critical success factors.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source