ZenML

Enterprise-Grade RAG Systems for Legal AI Platform

Harvey 2025
View original source

Harvey, a legal AI platform serving professional services firms, addresses the complex challenge of building enterprise-grade Retrieval-Augmented Generation (RAG) systems that can handle sensitive legal documents while maintaining high performance, accuracy, and security. The company leverages specialized vector databases like LanceDB Enterprise and Postgres with PGVector to power their RAG systems across three key data sources: user-uploaded files, long-term vault projects, and third-party legal databases. Through careful evaluation of vector database options and collaboration with domain experts, Harvey has built a system that achieves 91% preference over ChatGPT in tax law applications while serving users in 45 countries with strict privacy and compliance requirements.

Industry

Legal

Technologies

Overview

Harvey represents a comprehensive case study in enterprise-grade LLMOps for the legal sector, demonstrating how specialized AI platforms must navigate complex technical and regulatory requirements when deploying large language models in production. As a legal AI platform serving professional services firms across 45 countries, Harvey has built sophisticated infrastructure to support multiple LLM-powered products including an AI Assistant, document Vault, Knowledge research tools, and custom Workflows. The company’s approach highlights the critical importance of Retrieval-Augmented Generation (RAG) systems in making LLMs effective for domain-specific applications while maintaining the security and compliance standards required by enterprise legal clients.

The case study illustrates several key LLMOps challenges unique to enterprise deployments: handling sensitive data with strict privacy requirements, achieving high performance at scale across multilingual document corpora, and building evaluation frameworks for complex domain-specific use cases where traditional ground-truth datasets don’t exist. Harvey’s technical architecture decisions around vector database selection, data isolation strategies, and domain expert collaboration provide valuable insights into operationalizing LLMs for highly regulated industries.

Technical Architecture and Data Complexity

Harvey’s RAG implementation handles three distinct types of data sources, each presenting unique technical challenges for LLMOps. The first category involves user-uploaded files in their Assistant product, typically ranging from 1-50 documents that persist for the duration of a conversation thread. These require immediate availability for querying, placing emphasis on ingestion speed and real-time indexing capabilities. The second category encompasses user-stored documents in Vault projects, which can scale to 1,000-10,000 documents for long-term projects and require persistent storage with strong isolation guarantees between different client data.

The third and most complex category involves private and public third-party data sources containing legal regulations, statutes, and case law. This represents a comprehensive corpus of legal knowledge spanning multiple jurisdictions and languages, requiring sophisticated understanding of legal document structures, cross-references, and temporal evolution of laws. The multilingual nature of this data, covering 45 countries, adds significant complexity to the embedding and retrieval pipeline, as the system must handle different legal systems, document formats, and linguistic nuances while maintaining consistent accuracy across regions.

Vector Database Evaluation and Selection

The case study provides detailed insight into Harvey’s systematic approach to vector database selection, which represents a critical infrastructure decision for enterprise LLMOps. Their evaluation framework focused on four key dimensions: scalability and performance, accuracy of retrieval, flexibility and developer tooling, and enterprise readiness including privacy and security features. This comprehensive evaluation approach demonstrates the complexity of infrastructure decisions in production LLM systems, where technical performance must be balanced against compliance and security requirements.

Harvey’s comparison between Postgres with PGVector and LanceDB Enterprise reveals important tradeoffs in vector database selection. Postgres with PGVector offers strong performance for smaller-scale applications, with sub-2-second latency for 500K embeddings and high accuracy through brute force KNN or HNSW indexing. However, it faces limitations in horizontal scalability and data privacy isolation, as all data within a Postgres instance is centralized. LanceDB Enterprise, which Harvey ultimately selected for production use, provides superior scalability with unlimited horizontal scaling, better data privacy through decentralized storage in customer-controlled cloud buckets, and strong performance with sub-2-second latency for 15 million rows with metadata filtering.

The selection criteria highlight enterprise-specific requirements that differ significantly from typical vector database use cases. Harvey requires the ability to deploy vector databases within private cloud environments and store both embeddings and source data in customer-controlled storage systems. This ensures that sensitive legal documents never leave the client’s domain, addressing stringent security and privacy requirements that are non-negotiable in legal applications.

Domain Expertise Integration and Evaluation Challenges

One of the most significant LLMOps challenges highlighted in Harvey’s case study is the development of evaluation frameworks for complex, domain-specific applications where traditional benchmarks are inadequate. The company addresses this through close collaboration with domain experts, exemplified by their partnership with PwC’s tax professionals to develop a Tax AI Assistant that achieves 91% preference over ChatGPT. This collaboration involves multiple phases: initial guidance on authoritative data sources and domain nuances, ongoing usage and detailed feedback collection, and iterative model refinement based on expert evaluation.

The evaluation challenge is particularly acute in legal applications because legal questions often have complex, multilayered intents requiring deep domain knowledge to assess answer quality. Traditional metrics like exact match or BLEU scores are insufficient for evaluating legal AI systems, where the correctness of an answer depends on understanding jurisdictional differences, temporal changes in law, and complex precedent relationships. Harvey’s approach of working directly with domain experts to develop evaluation criteria and continuously refine models based on real-world usage represents a best practice for LLMOps in specialized domains.

Security and Privacy in Enterprise LLMOps

Harvey’s implementation demonstrates how security and privacy considerations fundamentally shape LLMOps architecture decisions in enterprise environments. The company’s requirement for customer data to remain within client domains drives their choice of vector database architecture, deployment strategies, and data handling procedures. Their approach ensures that sensitive legal documents and their corresponding embeddings never leave customer-controlled environments, addressing regulatory compliance requirements that are common in legal and financial services.

The privacy-first architecture impacts multiple aspects of their LLMOps pipeline. Data ingestion processes must support encryption at rest and in transit, with strict isolation between different clients’ data. The vector database deployment must support private cloud environments rather than shared multi-tenant systems. Even the embeddings generated from customer documents must be stored within customer-controlled storage systems, requiring careful coordination between the AI processing pipeline and storage infrastructure.

Retrieval Complexity and Performance Optimization

Harvey’s case study illuminates the technical complexities involved in building high-performance RAG systems for enterprise applications. The company identifies six key challenges in enterprise retrieval: balancing sparse versus dense representations, achieving performance at scale, maintaining accuracy at scale, handling complex query structures, processing complex domain-specific data, and developing appropriate evaluation frameworks.

The sparse versus dense representation challenge is particularly relevant for legal applications, where traditional keyword-based search might miss semantic nuances but purely dense embeddings could struggle with specific legal identifiers, case citations, and named entities that are crucial for legal research. Harvey’s approach likely involves hybrid retrieval strategies that combine both approaches, though the specific implementation details aren’t fully disclosed in the case study.

Performance optimization for large-scale legal corpora requires careful consideration of indexing strategies, particularly given the need for real-time ingestion of new documents while maintaining query performance across millions of existing documents. The system must support continuous updates as new legal documents, regulations, and case law are published, while ensuring that retrieval accuracy doesn’t degrade as the corpus grows.

Multi-Modal and Multi-Lingual Considerations

Operating across 45 countries introduces significant complexity to Harvey’s LLMOps implementation, requiring the system to handle multiple languages, legal systems, and document formats. This multilingual capability extends beyond simple translation to understanding different legal frameworks, citation formats, and jurisdictional differences. The embedding models must be capable of capturing semantic meaning across languages while maintaining the ability to retrieve relevant documents regardless of the query language or source document language.

The case study suggests that Harvey has developed sophisticated approaches to handle these multilingual and multi-jurisdictional challenges, though specific implementation details are limited. This likely involves specialized embedding models trained on multilingual legal corpora, careful preprocessing of documents to normalize different citation formats and legal structures, and retrieval strategies that can bridge language barriers while maintaining legal accuracy.

Harvey’s product suite demonstrates how LLMOps systems must integrate seamlessly with existing professional workflows to achieve adoption in enterprise environments. Their Word Add-In for contract drafting and review represents a particularly interesting example of embedding AI capabilities directly into lawyers’ existing tools. This integration requires careful consideration of user experience, performance requirements, and the balance between AI assistance and human oversight.

The Workflow Builder product allows firms to create custom workflows tailored to their specific practices, suggesting that Harvey has built flexible LLMOps infrastructure that can support diverse use cases without requiring complete system redesign. This flexibility is crucial for enterprise LLMOps, where different clients may have significantly different requirements and workflows.

Lessons and Best Practices

Harvey’s case study provides several important lessons for enterprise LLMOps implementations. The systematic approach to vector database evaluation demonstrates the importance of comprehensive technical evaluation that considers not just performance metrics but also security, compliance, and operational requirements. The emphasis on domain expert collaboration highlights how successful LLMOps in specialized fields requires deep integration with subject matter experts throughout the development and evaluation process.

The privacy-first architecture approach shows how security and compliance requirements can drive technical architecture decisions in ways that significantly impact system design and implementation complexity. However, the case study also demonstrates that these constraints can be successfully addressed without sacrificing performance or functionality, provided they are considered from the beginning of the design process rather than added as afterthoughts.

The multilingual and multi-jurisdictional capabilities suggest that successful enterprise LLMOps often requires significant customization for specific markets and regulatory environments, rather than assuming that general-purpose models will be sufficient for specialized applications. Harvey’s approach of building specialized infrastructure for legal applications represents a significant investment but appears to have yielded superior results compared to general-purpose alternatives like ChatGPT.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Agent AI System for Financial Intelligence and Risk Analysis

Moody’s 2025

Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.

fraud_detection document_processing question_answering +42

Large-Scale Legal RAG Implementation with Multimodal Data Infrastructure

Harvey / Lance 2025

Harvey, a legal AI assistant company, partnered with LanceDB to address complex retrieval-augmented generation (RAG) challenges across massive datasets of legal documents. The case study demonstrates how they built a scalable system to handle diverse legal queries ranging from small on-demand uploads to large data corpuses containing millions of documents from various jurisdictions. Their solution combines advanced vector search capabilities with a multimodal lakehouse architecture, emphasizing evaluation-driven development and flexible infrastructure to support the complex, domain-specific nature of legal AI applications.

document_processing question_answering classification +29