Thomson Reuters: RAG-Powered Customer Support Enhancement Using GPT-4

Company

Thomson Reuters

Title

RAG-Powered Customer Support Enhancement Using GPT-4

Industry

Legal

Link

https://medium.com/tr-labs-ml-engineering-blog/better-customer-support-using-retrieval-augmented-generation-rag-at-thomson-reuters-4d140a6044c3

Year

2024

Summary (short)

Thomson Reuters implemented a Retrieval-Augmented Generation (RAG) system to enhance customer support for their legal and tax domain products. The challenge involved customer support agents experiencing cognitive overload while navigating hundreds of thousands of knowledge base articles across complex product lines like Westlaw, Practical Law, and Checkpoint. By building a RAG architecture combining dense retrieval systems (using Milvus vector database and sentence transformers) with GPT-4, Thomson Reuters created a conversational interface that provides agents with relevant, accurate solutions from their curated knowledge base. The solution reduced resolution times and improved the accuracy of support responses by grounding GPT-4's outputs in company-specific documentation, avoiding hallucinations common in standalone LLM deployments.

Tags

## Overview and Business Context Thomson Reuters, a leading business information and content technology provider serving legal, tax, and media sectors, implemented a production RAG (Retrieval-Augmented Generation) system to address critical challenges in their customer support operations. The company's flagship products—including Westlaw, Practical Law, Checkpoint, and Reuters News—serve highly specialized customers such as attorneys, executives, government agencies, and media organizations who require expert-level support. The business case for this LLMOps initiative was driven by several converging factors. First, customer expectations for support quality continue to rise, with survey data cited in the article indicating that 78% of customers say support experiences determine whether they continue buying, and 58% report their expectations are higher year-over-year. Second, the company needed to maintain its competitive edge in serving customers who navigate complex regulatory environments and require highly accurate, domain-specific information. The initiative was led by Thomson Reuters Labs working in coordination with customer success functions. ## The Problem Space The customer support challenges at Thomson Reuters were multi-faceted and stemmed from the complexity of their domain expertise. Support agents were required to quickly process and synthesize information across multiple systems including CRM platforms, hundreds of thousands of knowledge base articles, and ticketing systems—all while serving customers who are themselves experts in highly specialized legal and tax fields. This created a significant cognitive overload situation for agents. A particularly critical gap existed in knowledge transfer and accessibility. When one agent discovered a resolution to a problem, that solution often remained unavailable to other agents in a structured, retrievable format. This led to dependence on person-to-person knowledge transfer rather than scalable, systematic knowledge management. The article frames this as a "finding the signal in the noise" problem where valuable information existed but couldn't be efficiently accessed when needed. The domain-specific nature of legal and tax support added another layer of complexity. Unlike general customer service, agents needed to navigate constantly changing regulations, product updates, and highly technical subject matter. Traditional search and knowledge base systems apparently weren't providing the level of semantic understanding and contextual retrieval needed to keep pace with both the information volume and the speed required for quality support. ## Solution Architecture: RAG Implementation Thomson Reuters chose to implement a Retrieval-Augmented Generation architecture, which the article describes as "a recipe or pattern for ensuring factual generation of responses in large pre-trained language models." The team explicitly chose RAG to address known LLM limitations including factual inaccuracies (hallucinations) and inability to provide provenance—the ability to cite sources of information. The architecture introduces what they call a "non-parametric component" that allows for flexible, updateable knowledge without retraining the base model. The implementation consists of two primary flows: the processing and indexing flow, and the retrieval flow. In the processing and indexing flow, data from knowledge base articles and CRM tools is extracted and processed into chunks suitable for embedding generation. These text chunks are then converted into dense vector embeddings using pre-trained language models. The article mentions that embeddings can be generated through various approaches including BERT, RoBERTa, T5, or API-based solutions like OpenAI's text-embedding-ada-002. For their specific implementation, Thomson Reuters used an open-source sentence transformer from Hugging Face called all-MiniLM-L6-v2. These embeddings are stored in what the article describes as "dense retrieval systems also known as Vector databases." Thomson Reuters selected Milvus, an open-source vector database, for this component. The article provides context about the rapid growth of the vector database market, noting that competitors like Pinecone, OpenSearch, pgvector, and Weaviate have emerged, with Pinecone specifically raising $100 million at a $750 million valuation—highlighting the "hot" nature of this space for LLM applications. The retrieval flow represents the core operational component of the RAG system. When a support agent submits a query through the conversational interface, it undergoes two main processing stages. First, the dense retrieval system encodes the query into a dense vector and computes similarity against the stored document embeddings using distance metrics such as cosine similarity or Euclidean distance. The system retrieves the documents or passages with the highest similarity scores, which represent the most semantically relevant content from the knowledge base. The retrieved context is then passed to the sequence-to-sequence model stage, which in this implementation uses OpenAI's GPT-4 API. The article notes that while numerous open-source LLMs like LLAMA, MPT, and Falcon perform well, GPT-4 was chosen for its demonstrated state-of-the-art performance. The most relevant context retrieved from the vector database is concatenated with carefully crafted prompts and sent to the GPT-4 API to generate appropriate responses for the support agent. ## LLMOps Considerations and Technical Decisions The case study reveals several important LLMOps considerations that influenced Thomson Reuters' architectural decisions. The choice of RAG over fine-tuning or relying solely on base LLM knowledge reflects a pragmatic approach to maintaining accuracy and currency of information. The article emphasizes that RAG provides "a more economical means to introduce new/update knowledge over retraining LLMs," which is crucial for a domain where regulations, product features, and best practices evolve continuously. The distinction between parametric and non-parametric knowledge is central to their rationale. Parametric knowledge—what's learned during LLM training—becomes fixed and less adaptable to new information once training is complete. Non-parametric methods, by contrast, allow for flexible maintenance and updating of knowledge post-training. This adaptability was deemed "crucial in real-world applications where the data may evolve over time," which precisely describes the legal and tax support environment. The selection of specific technologies reveals operational priorities. By choosing Milvus, an open-source vector database, rather than a proprietary solution, Thomson Reuters maintained flexibility and potentially reduced vendor lock-in. The use of the all-MiniLM-L6-v2 sentence transformer suggests a focus on efficient, lightweight embeddings that balance performance with computational cost—this model is known for being relatively small while maintaining good performance for semantic similarity tasks. The decision to use GPT-4 via API rather than self-hosting an open-source alternative indicates a prioritization of output quality and time-to-production over full control of the model infrastructure. This represents a common LLMOps tradeoff: leveraging best-in-class commercial APIs versus managing open-source models internally. The article doesn't discuss latency considerations, cost management, or API rate limiting strategies, which would typically be important concerns in production deployments. ## Production Deployment and Real-World Performance The article provides a concrete example comparing GPT-4 responses with and without RAG for a specific technical support scenario. The query involved a tax software error: "1040 e-file error: IND-041Error If 'ThirdPartyDesigneeInd' in the return has a choice of 'Yes' indicated, then 'ThirdPartyDesigneePIN' must have a value." Without RAG, GPT-4 provided a plausible-sounding but generic response explaining the error conceptually and offering general troubleshooting steps. The response demonstrated the model's general knowledge about tax forms and IRS procedures but lacked specificity to Thomson Reuters' products. The article notes that "though the above response does look like it makes sense this is not the most accurate response to solve the issue." With RAG enabled, the system provided specific, actionable steps tied to Thomson Reuters' product interface: navigating to the Organizer, clicking on General Information, opening Basic Return Information, accessing Paid Preparer Information, and correcting fields in the Third Party Designee section. The article claims this response "is most accurate as this match with the resolution to solve the issue within our products with the most recent information." This comparison illustrates the core value proposition of RAG in production: grounding LLM outputs in verified, company-specific documentation to ensure accuracy and actionability. However, it's worth noting that the article doesn't provide comprehensive evaluation metrics, success rates across different query types, or information about edge cases and failure modes—details that would normally be important in assessing production LLM system performance. ## Critical Assessment and Balanced Perspective While the case study presents a compelling success story, several aspects warrant balanced consideration. The article claims the solution "reduce[d] resolution times" and delivered "better, faster customer service," but provides no quantitative metrics to support these claims. There are no statistics on percentage reduction in average handling time, customer satisfaction scores, or agent productivity improvements. The single before-and-after example, while illustrative, doesn't constitute comprehensive evaluation. The article also doesn't address several typical LLMOps challenges that would be relevant to this deployment. There's no discussion of how the system handles ambiguous queries, conflicts between retrieved documents, or situations where the knowledge base doesn't contain relevant information. The prompt engineering approach—how instructions are crafted to guide GPT-4's use of retrieved context—is mentioned but not detailed, yet this is often a critical factor in RAG system performance. Cost considerations are entirely absent from the discussion. Running GPT-4 API calls for customer support at scale can be expensive, and there's no mention of cost-benefit analysis, optimization strategies, or whether usage patterns align with business economics. Similarly, there's no information about system monitoring, performance tracking, or continuous improvement processes—all essential components of production LLMOps. The article presents RAG's advantages as reducing hallucinations, providing provenance, and enabling economical knowledge updates, which are legitimate benefits. However, RAG systems also introduce new failure modes: retrieval failures (relevant information not retrieved), noise (irrelevant information retrieved), and brittleness to query phrasing. The article doesn't acknowledge these tradeoffs or explain how Thomson Reuters addresses them. ## Broader LLMOps Insights Despite these limitations in quantitative detail, the case study does illustrate several important LLMOps patterns and principles. First, it demonstrates the value of RAG for domain-specific applications where accuracy and verifiability are paramount—legal and tax support being exemplary domains where hallucinations could cause serious problems. The emphasis on provenance aligns with professional standards in legal services where being able to cite sources is often required. Second, the hybrid approach of combining open-source components (Milvus, sentence transformers) with commercial APIs (GPT-4) represents a pragmatic middle path that many organizations adopt. This allows teams to maintain control and flexibility over some components while leveraging state-of-the-art capabilities where they matter most—in this case, the final generation quality. Third, the focus on augmenting rather than replacing human agents reflects a realistic approach to AI deployment in high-stakes domains. The system provides agents with better information more quickly, but agents remain in control of customer interactions and final decisions. This human-in-the-loop approach is often more practical and safer than fully automated systems, particularly in specialized professional services. The article contextualizes this work within Thomson Reuters' 30-year history of applying AI to information access, noting they "delivered a large-scale natural language search system to market before Google even existed." This historical perspective suggests organizational capability in deploying AI systems at scale, though it doesn't provide details about how lessons from previous systems informed this RAG deployment. ## Technical Evolution and Industry Context The case study positions this work within the rapid evolution of LLM capabilities and the emergence of the vector database ecosystem. The article notes that RAG as a concept was founded in a 2021 paper but "has become a lot more prevalent with LLMs in the past few months," reflecting the acceleration of practical implementations following the release of ChatGPT and GPT-4. The mention of various vector database options and the significant venture capital flowing into this space (Pinecone's $100M raise) illustrates how quickly the LLMOps tooling ecosystem was maturing. This proliferation of specialized infrastructure components makes it increasingly feasible for organizations to build production RAG systems without building everything from scratch. The collaboration between Thomson Reuters Labs (the research function) and customer success teams exemplifies cross-functional cooperation required for successful LLMOps initiatives. Research teams bring technical expertise in ML/AI, while business teams provide domain knowledge, use case definition, and evaluation criteria. This coordination is often cited as a critical success factor in moving from research prototypes to production systems. In summary, this case study documents an early production deployment of RAG technology in a high-stakes, domain-specific context. While the article lacks comprehensive metrics and doesn't fully address typical LLMOps challenges, it illustrates important architectural patterns, technology choices, and the practical value of grounding LLM outputs in verified knowledge bases. The emphasis on accuracy, provenance, and augmenting expert human agents reflects appropriate priorities for legal and tax support applications.

Start deploying reproducible AI workflows today