Company
Thomson Reuters
Title
Enterprise LLM Playground Development for Internal AI Experimentation
Industry
Media & Entertainment
Year
2023
Summary (short)
Thomson Reuters developed Open Arena, an enterprise-wide LLM playground, in under 6 weeks using AWS services. The platform enables non-technical employees to experiment with various LLMs in a secure environment, combining open-source and in-house models with company data. The solution saw rapid adoption with over 1,000 monthly users and helped drive innovation across the organization by allowing safe experimentation with generative AI capabilities.
## Overview Thomson Reuters, a global content and technology-driven company with a long history in AI and natural language processing dating back to their 1992 Westlaw Is Natural (WIN) system, developed an enterprise-wide LLM experimentation platform called "Open Arena." This initiative emerged from an internal AI/ML hackathon and was built in collaboration with AWS in under 6 weeks. The platform represents a significant effort to democratize access to generative AI capabilities across the organization, allowing employees without coding backgrounds to experiment with LLMs and identify potential business use cases. The primary objective was to create a safe, secure, and user-friendly "playground" environment where internal teams could explore both in-house developed and open-source LLMs, while also discovering unique applications by combining LLM capabilities with Thomson Reuters's extensive proprietary data. This case study provides valuable insights into how a large enterprise can rapidly deploy an LLM experimentation infrastructure using managed cloud services. ## Architecture and Infrastructure The Open Arena platform was built entirely on AWS managed services, prioritizing scalability, cost-effectiveness, and rapid deployment. The architecture leverages a serverless approach that allows for modular expansion as new AI trends and models emerge. ### Core Infrastructure Components Amazon SageMaker serves as the backbone of the platform, handling model deployment as SageMaker endpoints and providing a robust environment for model fine-tuning. The team utilized Hugging Face Deep Learning Containers (DLCs) offered through the AWS and Hugging Face partnership, which significantly accelerated the deployment process. The SageMaker Hugging Face Inference Toolkit combined with the Accelerate library was instrumental in handling the computational demands of running complex, resource-intensive models. AWS Lambda functions, triggered by Amazon API Gateway, manage the API layer and handle preprocessing and postprocessing of data. The front end is deployed as a static site on Amazon S3, with Amazon CloudFront providing content delivery and integration with the company's single sign-on mechanism for user authentication. Amazon DynamoDB was chosen as the NoSQL database service for storing and managing operational data including user queries, responses, response times, and user metadata. For continuous integration and deployment, the team employed AWS CodeBuild and AWS CodePipeline, establishing a proper CI/CD workflow. Amazon CloudWatch provides monitoring capabilities with custom dashboards and comprehensive logging. ### Security Considerations Security was a primary concern from the platform's inception. The architecture ensures that all data used for fine-tuning LLMs remains encrypted and does not leave the Virtual Private Cloud (VPC), maintaining data privacy and confidentiality. This is particularly important for an enterprise like Thomson Reuters that handles sensitive legal, financial, and news content. ## Model Development and Integration ### Model Selection and Experimentation The Open Arena platform has been designed to integrate seamlessly with multiple LLMs through REST APIs, providing flexibility to quickly incorporate new state-of-the-art models as they are released. This architectural decision reflects an understanding of the rapidly evolving generative AI landscape. The team experimented with several open-source models including Flan-T5-XL, Open Assistant, MPT, and Falcon. They also fine-tuned Flan-T5-XL on available open-source datasets using parameter-efficient fine-tuning (PEFT) techniques, which allow for model adaptation with reduced computational resources compared to full fine-tuning. For optimization, the team utilized bitsandbytes integration from Hugging Face to experiment with various quantization techniques. Quantization reduces model size and inference latency by using lower-precision numerical representations, which is critical for production deployments where cost and latency are important considerations. ### Model Evaluation Criteria The team developed a structured approach to model selection, considering both performance and engineering aspects. Key evaluation criteria included: - Performance on NLP tasks relevant to Thomson Reuters use cases - Cost-effectiveness analysis comparing larger models against smaller ones to determine if performance gains justify increased costs - Ability to handle long documents, which is essential for legal and news content processing - Efficiency in integrating and deploying models into applications running on AWS - Secure customization capabilities ensuring data encryption during fine-tuning - Flexibility to choose from a wide selection of models for varied use cases Models were evaluated on both open-source legal datasets and Thomson Reuters internal datasets to assess suitability for specific use cases. ## Retrieval Augmented Generation (RAG) Pipeline For content-based experiences that require answers from specific corpora, the team implemented a sophisticated RAG pipeline. This approach is essential for grounding LLM responses in authoritative company data rather than relying solely on the model's parametric knowledge. ### RAG Implementation Details The RAG pipeline follows a standard but well-implemented approach. Documents are first split into chunks, then embeddings are created for each chunk and stored in OpenSearch (AWS's managed Elasticsearch service). This creates a searchable vector database of company content. To retrieve the most relevant documents or chunks for a given query, the team implemented a retrieval/re-ranker approach based on bi-encoder and cross-encoder models. Bi-encoders efficiently encode queries and documents into dense vectors for fast similarity search, while cross-encoders provide more accurate relevance scoring by jointly encoding query-document pairs. This two-stage retrieval approach balances efficiency with accuracy. The retrieved best-matching content is then passed as context to the LLM along with the user's query to generate responses grounded in Thomson Reuters's proprietary content. This integration of internal content with LLM capabilities has been instrumental in enabling users to extract relevant and insightful results while sparking ideas for AI-enabled solutions across business workflows. ## User Experience and Interface Design Open Arena adopts a tile-based interface design with pre-set enabling tiles for different experiences. This approach simplifies user interaction and makes the platform accessible to employees without technical backgrounds. ### Available Experiences The platform offers several distinct interaction modes: - **Experiment with Open Source LLM**: Opens a chat-like interaction channel with open-source LLMs for general experimentation - **Ask your Document**: Allows users to upload documents and ask specific questions related to the content, leveraging the RAG pipeline - **Experiment with Summarization**: Enables users to distill large volumes of text into concise summaries These pre-set tiles cater to specific user requirements while simplifying navigation within the platform. The design choice to create task-specific interfaces rather than a generic chat interface helps guide users toward productive experimentation and accelerates use case discovery. ## Production Metrics and Impact The platform achieved significant adoption within the first month of launch, with over 1,000 monthly internal users from Thomson Reuters's global operations. Users averaged approximately 5-minute interaction sessions, indicating meaningful engagement rather than superficial exploration. User testimonials highlight several key benefits: - Enabling hands-on AI learning for employees across all parts of the company, not just technical teams - Providing a safe environment for experimenting with actual company content (such as news stories) without data leak concerns - Responsive feature development based on user feedback - Inspiring new ideas for AI applications, such as customer support agent interfaces The platform has served as an effective sandbox for AI experimentation, allowing teams to identify and refine AI applications before incorporating them into production products. This approach accelerates the development pipeline by validating concepts before significant engineering investment. ## Future Development and Roadmap The team indicates ongoing development to add features and enhance platform capabilities. Notably, they mention plans to integrate AWS services like Amazon Bedrock and Amazon SageMaker JumpStart, which would expand access to additional foundation models including those from Anthropic, AI21 Labs, Stability AI, and Amazon's own Titan models. Beyond platform development, Thomson Reuters is actively working on "productionizing the multitude of use cases generated by the platform," suggesting that Open Arena has successfully served its purpose as an innovation catalyst. The most promising experiments are being developed into production AI features for customer-facing products. ## Critical Assessment While this case study presents an impressive rapid development story, it's important to note several considerations: The 6-week development timeline is notable but should be understood in context—this is an internal experimentation platform, not a production customer-facing system. The compliance, testing, and reliability requirements for internal tools are typically less stringent than external products. The metrics provided (1,000 monthly users, 5-minute average sessions) are useful but limited. There's no information about conversion rates from experimentation to actual production use cases, or quantitative measures of business value generated. The heavy emphasis on AWS services reflects the collaborative nature of this case study between Thomson Reuters and AWS, which was published on an AWS blog. While the architectural choices appear sound, alternative approaches using other cloud providers or open-source infrastructure are not discussed. Nevertheless, the approach of creating a centralized, secure LLM experimentation environment represents a practical strategy for enterprises looking to foster AI innovation while maintaining governance and security controls. The modular, serverless architecture provides a template for organizations looking to build similar capabilities.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.