Thomson Reuters: Evaluating Long Context Performance in Legal AI Applications

LLMOps Database

Legal

Thomson Reuters

Company

Thomson Reuters

Title

Evaluating Long Context Performance in Legal AI Applications

Industry

Legal

Link

https://www.thomsonreuters.com/en-us/posts/innovation/legal-ai-benchmarking-evaluating-long-context-performance-for-llms/

Year

2025

Summary (short)

Thomson Reuters details their comprehensive approach to evaluating and deploying long-context LLMs in their legal AI assistant CoCounsel. They developed rigorous testing protocols to assess LLM performance with lengthy legal documents, implementing a multi-LLM strategy rather than relying on a single model. Through extensive benchmarking and testing, they found that using full document context generally outperformed RAG for most document-based legal tasks, leading to strategic decisions about when to use each approach in production.

Tags

document_processing

high_stakes_application

regulatory_compliance

Thomson Reuters presents a detailed case study of their production LLM system CoCounsel, focusing on how they evaluate and implement long-context capabilities in their legal AI applications. This case study provides valuable insights into the practical challenges and solutions for deploying LLMs in high-stakes professional environments where document analysis is crucial. The company faces a unique challenge in the legal domain where documents are typically very long - deposition transcripts and merger agreements often exceed hundreds of pages. When GPT-4 was first released with an 8K token context window, they had to implement document splitting and synthesis strategies. Even with modern LLMs supporting 128K to 1M tokens, they discovered that larger context windows don't automatically guarantee better performance, leading them to develop sophisticated testing protocols. Their LLMOps approach is built around several key strategic decisions and technical implementations: **Multi-LLM Strategy** Thomson Reuters adopts a multi-LLM approach rather than depending on a single model. This decision is based on their observation that different models excel at different tasks - some are better at reasoning, others at handling long documents, and others at following instructions precisely. This diversity in model capabilities is treated as a competitive advantage rather than a fallback strategy. **Testing and Evaluation Infrastructure** They've implemented a comprehensive testing framework that includes: * Initial benchmarks using over 20,000 test samples from both open and private sources * Specialized long-context benchmarks using LOFT and NovelQA tests that can handle up to 1M input tokens * Skill-specific benchmarks developed with attorney subject matter experts (SMEs) * LLM-as-a-judge evaluation system for automated assessment * Final manual review by legal experts **RAG vs. Long Context Implementation** After extensive testing, they made an important architectural decision regarding RAG implementation. While RAG is effective for searching vast document collections or finding simple factoid answers, they found that inputting full document text into the LLM's context window generally performed better for complex document analysis tasks. This led to a hybrid approach where: * Full document context is used whenever possible for complex analysis tasks * RAG is reserved specifically for skills requiring repository searches * Documents exceeding context limits are chunked strategically **Production Deployment Process** Their deployment pipeline includes several stages: * Initial automated benchmarking across multiple capabilities * Iterative prompt flow development for specific legal skills * Development of comprehensive grading criteria with SME input * Automated evaluation using LLM-as-a-judge against SME-authored criteria * Manual review and verification by attorney SMEs before deployment **Technical Insights and Learnings** The case study reveals several important technical findings: * The effective context window of LLMs is often smaller than their theoretical maximum, particularly for complex tasks * More complex skills result in smaller effective context windows * Document splitting is still necessary even with long-context models to ensure reliable information retrieval * Performance varies significantly based on task complexity and document length **Quality Assurance and Testing** Their testing approach is particularly thorough: * Test samples are carefully curated to represent real use cases * Specialized long context test sets use documents ranging from 100K to 1M tokens * Testing includes both single-document and multi-document scenarios * Evaluation criteria are developed through collaboration between engineers and legal experts * Continuous monitoring and evaluation of new model releases The case study also highlights their position as an early tester for major AI labs, giving them access to new models and capabilities. This allows them to continuously evaluate and integrate new developments into their production systems. Looking forward, Thomson Reuters is working on expanding their system beyond simple query-response patterns toward more agentic AI systems that can handle complex legal workflows autonomously. Their current infrastructure and testing frameworks serve as the foundation for this evolution. This case study demonstrates the importance of rigorous testing and evaluation in LLMOps, especially in high-stakes domains like legal applications. It shows how theoretical capabilities of LLMs need to be carefully validated and often modified for practical production use, and how a multi-model approach can provide more robust and reliable results than depending on a single model.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free