Harvey: Enterprise-Grade RAG Systems for Legal AI Platform

LLMOps Database

Legal

Harvey

Company

Harvey

Title

Enterprise-Grade RAG Systems for Legal AI Platform

Industry

Legal

Link

https://www.harvey.ai/blog/enterprise-grade-rag-systems

Year

2025

Summary (short)

Harvey, a legal AI platform serving professional services firms, addresses the complex challenge of building enterprise-grade Retrieval-Augmented Generation (RAG) systems that can handle sensitive legal documents while maintaining high performance, accuracy, and security. The company leverages specialized vector databases like LanceDB Enterprise and Postgres with PGVector to power their RAG systems across three key data sources: user-uploaded files, long-term vault projects, and third-party legal databases. Through careful evaluation of vector database options and collaboration with domain experts, Harvey has built a system that achieves 91% preference over ChatGPT in tax law applications while serving users in 45 countries with strict privacy and compliance requirements.

## Overview Harvey represents a comprehensive case study in enterprise-grade LLMOps for the legal sector, demonstrating how specialized AI platforms must navigate complex technical and regulatory requirements when deploying large language models in production. As a legal AI platform serving professional services firms across 45 countries, Harvey has built sophisticated infrastructure to support multiple LLM-powered products including an AI Assistant, document Vault, Knowledge research tools, and custom Workflows. The company's approach highlights the critical importance of Retrieval-Augmented Generation (RAG) systems in making LLMs effective for domain-specific applications while maintaining the security and compliance standards required by enterprise legal clients. The case study illustrates several key LLMOps challenges unique to enterprise deployments: handling sensitive data with strict privacy requirements, achieving high performance at scale across multilingual document corpora, and building evaluation frameworks for complex domain-specific use cases where traditional ground-truth datasets don't exist. Harvey's technical architecture decisions around vector database selection, data isolation strategies, and domain expert collaboration provide valuable insights into operationalizing LLMs for highly regulated industries. ## Technical Architecture and Data Complexity Harvey's RAG implementation handles three distinct types of data sources, each presenting unique technical challenges for LLMOps. The first category involves user-uploaded files in their Assistant product, typically ranging from 1-50 documents that persist for the duration of a conversation thread. These require immediate availability for querying, placing emphasis on ingestion speed and real-time indexing capabilities. The second category encompasses user-stored documents in Vault projects, which can scale to 1,000-10,000 documents for long-term projects and require persistent storage with strong isolation guarantees between different client data. The third and most complex category involves private and public third-party data sources containing legal regulations, statutes, and case law. This represents a comprehensive corpus of legal knowledge spanning multiple jurisdictions and languages, requiring sophisticated understanding of legal document structures, cross-references, and temporal evolution of laws. The multilingual nature of this data, covering 45 countries, adds significant complexity to the embedding and retrieval pipeline, as the system must handle different legal systems, document formats, and linguistic nuances while maintaining consistent accuracy across regions. ## Vector Database Evaluation and Selection The case study provides detailed insight into Harvey's systematic approach to vector database selection, which represents a critical infrastructure decision for enterprise LLMOps. Their evaluation framework focused on four key dimensions: scalability and performance, accuracy of retrieval, flexibility and developer tooling, and enterprise readiness including privacy and security features. This comprehensive evaluation approach demonstrates the complexity of infrastructure decisions in production LLM systems, where technical performance must be balanced against compliance and security requirements. Harvey's comparison between Postgres with PGVector and LanceDB Enterprise reveals important tradeoffs in vector database selection. Postgres with PGVector offers strong performance for smaller-scale applications, with sub-2-second latency for 500K embeddings and high accuracy through brute force KNN or HNSW indexing. However, it faces limitations in horizontal scalability and data privacy isolation, as all data within a Postgres instance is centralized. LanceDB Enterprise, which Harvey ultimately selected for production use, provides superior scalability with unlimited horizontal scaling, better data privacy through decentralized storage in customer-controlled cloud buckets, and strong performance with sub-2-second latency for 15 million rows with metadata filtering. The selection criteria highlight enterprise-specific requirements that differ significantly from typical vector database use cases. Harvey requires the ability to deploy vector databases within private cloud environments and store both embeddings and source data in customer-controlled storage systems. This ensures that sensitive legal documents never leave the client's domain, addressing stringent security and privacy requirements that are non-negotiable in legal applications. ## Domain Expertise Integration and Evaluation Challenges One of the most significant LLMOps challenges highlighted in Harvey's case study is the development of evaluation frameworks for complex, domain-specific applications where traditional benchmarks are inadequate. The company addresses this through close collaboration with domain experts, exemplified by their partnership with PwC's tax professionals to develop a Tax AI Assistant that achieves 91% preference over ChatGPT. This collaboration involves multiple phases: initial guidance on authoritative data sources and domain nuances, ongoing usage and detailed feedback collection, and iterative model refinement based on expert evaluation. The evaluation challenge is particularly acute in legal applications because legal questions often have complex, multilayered intents requiring deep domain knowledge to assess answer quality. Traditional metrics like exact match or BLEU scores are insufficient for evaluating legal AI systems, where the correctness of an answer depends on understanding jurisdictional differences, temporal changes in law, and complex precedent relationships. Harvey's approach of working directly with domain experts to develop evaluation criteria and continuously refine models based on real-world usage represents a best practice for LLMOps in specialized domains. ## Security and Privacy in Enterprise LLMOps Harvey's implementation demonstrates how security and privacy considerations fundamentally shape LLMOps architecture decisions in enterprise environments. The company's requirement for customer data to remain within client domains drives their choice of vector database architecture, deployment strategies, and data handling procedures. Their approach ensures that sensitive legal documents and their corresponding embeddings never leave customer-controlled environments, addressing regulatory compliance requirements that are common in legal and financial services. The privacy-first architecture impacts multiple aspects of their LLMOps pipeline. Data ingestion processes must support encryption at rest and in transit, with strict isolation between different clients' data. The vector database deployment must support private cloud environments rather than shared multi-tenant systems. Even the embeddings generated from customer documents must be stored within customer-controlled storage systems, requiring careful coordination between the AI processing pipeline and storage infrastructure. ## Retrieval Complexity and Performance Optimization Harvey's case study illuminates the technical complexities involved in building high-performance RAG systems for enterprise applications. The company identifies six key challenges in enterprise retrieval: balancing sparse versus dense representations, achieving performance at scale, maintaining accuracy at scale, handling complex query structures, processing complex domain-specific data, and developing appropriate evaluation frameworks. The sparse versus dense representation challenge is particularly relevant for legal applications, where traditional keyword-based search might miss semantic nuances but purely dense embeddings could struggle with specific legal identifiers, case citations, and named entities that are crucial for legal research. Harvey's approach likely involves hybrid retrieval strategies that combine both approaches, though the specific implementation details aren't fully disclosed in the case study. Performance optimization for large-scale legal corpora requires careful consideration of indexing strategies, particularly given the need for real-time ingestion of new documents while maintaining query performance across millions of existing documents. The system must support continuous updates as new legal documents, regulations, and case law are published, while ensuring that retrieval accuracy doesn't degrade as the corpus grows. ## Multi-Modal and Multi-Lingual Considerations Operating across 45 countries introduces significant complexity to Harvey's LLMOps implementation, requiring the system to handle multiple languages, legal systems, and document formats. This multilingual capability extends beyond simple translation to understanding different legal frameworks, citation formats, and jurisdictional differences. The embedding models must be capable of capturing semantic meaning across languages while maintaining the ability to retrieve relevant documents regardless of the query language or source document language. The case study suggests that Harvey has developed sophisticated approaches to handle these multilingual and multi-jurisdictional challenges, though specific implementation details are limited. This likely involves specialized embedding models trained on multilingual legal corpora, careful preprocessing of documents to normalize different citation formats and legal structures, and retrieval strategies that can bridge language barriers while maintaining legal accuracy. ## Integration with Legal Workflows Harvey's product suite demonstrates how LLMOps systems must integrate seamlessly with existing professional workflows to achieve adoption in enterprise environments. Their Word Add-In for contract drafting and review represents a particularly interesting example of embedding AI capabilities directly into lawyers' existing tools. This integration requires careful consideration of user experience, performance requirements, and the balance between AI assistance and human oversight. The Workflow Builder product allows firms to create custom workflows tailored to their specific practices, suggesting that Harvey has built flexible LLMOps infrastructure that can support diverse use cases without requiring complete system redesign. This flexibility is crucial for enterprise LLMOps, where different clients may have significantly different requirements and workflows. ## Lessons and Best Practices Harvey's case study provides several important lessons for enterprise LLMOps implementations. The systematic approach to vector database evaluation demonstrates the importance of comprehensive technical evaluation that considers not just performance metrics but also security, compliance, and operational requirements. The emphasis on domain expert collaboration highlights how successful LLMOps in specialized fields requires deep integration with subject matter experts throughout the development and evaluation process. The privacy-first architecture approach shows how security and compliance requirements can drive technical architecture decisions in ways that significantly impact system design and implementation complexity. However, the case study also demonstrates that these constraints can be successfully addressed without sacrificing performance or functionality, provided they are considered from the beginning of the design process rather than added as afterthoughts. The multilingual and multi-jurisdictional capabilities suggest that successful enterprise LLMOps often requires significant customization for specific markets and regulatory environments, rather than assuming that general-purpose models will be sufficient for specialized applications. Harvey's approach of building specialized infrastructure for legal applications represents a significant investment but appears to have yielded superior results compared to general-purpose alternatives like ChatGPT.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source