Thomson Reuters developed Open Arena, an enterprise-wide LLM playground, in under 6 weeks using AWS services. The platform enables non-technical employees to experiment with various LLMs in a secure environment, combining open-source and in-house models with company data. The solution saw rapid adoption with over 1,000 monthly users and helped drive innovation across the organization by allowing safe experimentation with generative AI capabilities.
Thomson Reuters, a global content and technology-driven company with a long history in AI and natural language processing dating back to their 1992 Westlaw Is Natural (WIN) system, developed an enterprise-wide LLM experimentation platform called “Open Arena.” This initiative emerged from an internal AI/ML hackathon and was built in collaboration with AWS in under 6 weeks. The platform represents a significant effort to democratize access to generative AI capabilities across the organization, allowing employees without coding backgrounds to experiment with LLMs and identify potential business use cases.
The primary objective was to create a safe, secure, and user-friendly “playground” environment where internal teams could explore both in-house developed and open-source LLMs, while also discovering unique applications by combining LLM capabilities with Thomson Reuters’s extensive proprietary data. This case study provides valuable insights into how a large enterprise can rapidly deploy an LLM experimentation infrastructure using managed cloud services.
The Open Arena platform was built entirely on AWS managed services, prioritizing scalability, cost-effectiveness, and rapid deployment. The architecture leverages a serverless approach that allows for modular expansion as new AI trends and models emerge.
Amazon SageMaker serves as the backbone of the platform, handling model deployment as SageMaker endpoints and providing a robust environment for model fine-tuning. The team utilized Hugging Face Deep Learning Containers (DLCs) offered through the AWS and Hugging Face partnership, which significantly accelerated the deployment process. The SageMaker Hugging Face Inference Toolkit combined with the Accelerate library was instrumental in handling the computational demands of running complex, resource-intensive models.
AWS Lambda functions, triggered by Amazon API Gateway, manage the API layer and handle preprocessing and postprocessing of data. The front end is deployed as a static site on Amazon S3, with Amazon CloudFront providing content delivery and integration with the company’s single sign-on mechanism for user authentication.
Amazon DynamoDB was chosen as the NoSQL database service for storing and managing operational data including user queries, responses, response times, and user metadata. For continuous integration and deployment, the team employed AWS CodeBuild and AWS CodePipeline, establishing a proper CI/CD workflow. Amazon CloudWatch provides monitoring capabilities with custom dashboards and comprehensive logging.
Security was a primary concern from the platform’s inception. The architecture ensures that all data used for fine-tuning LLMs remains encrypted and does not leave the Virtual Private Cloud (VPC), maintaining data privacy and confidentiality. This is particularly important for an enterprise like Thomson Reuters that handles sensitive legal, financial, and news content.
The Open Arena platform has been designed to integrate seamlessly with multiple LLMs through REST APIs, providing flexibility to quickly incorporate new state-of-the-art models as they are released. This architectural decision reflects an understanding of the rapidly evolving generative AI landscape.
The team experimented with several open-source models including Flan-T5-XL, Open Assistant, MPT, and Falcon. They also fine-tuned Flan-T5-XL on available open-source datasets using parameter-efficient fine-tuning (PEFT) techniques, which allow for model adaptation with reduced computational resources compared to full fine-tuning.
For optimization, the team utilized bitsandbytes integration from Hugging Face to experiment with various quantization techniques. Quantization reduces model size and inference latency by using lower-precision numerical representations, which is critical for production deployments where cost and latency are important considerations.
The team developed a structured approach to model selection, considering both performance and engineering aspects. Key evaluation criteria included:
Models were evaluated on both open-source legal datasets and Thomson Reuters internal datasets to assess suitability for specific use cases.
For content-based experiences that require answers from specific corpora, the team implemented a sophisticated RAG pipeline. This approach is essential for grounding LLM responses in authoritative company data rather than relying solely on the model’s parametric knowledge.
The RAG pipeline follows a standard but well-implemented approach. Documents are first split into chunks, then embeddings are created for each chunk and stored in OpenSearch (AWS’s managed Elasticsearch service). This creates a searchable vector database of company content.
To retrieve the most relevant documents or chunks for a given query, the team implemented a retrieval/re-ranker approach based on bi-encoder and cross-encoder models. Bi-encoders efficiently encode queries and documents into dense vectors for fast similarity search, while cross-encoders provide more accurate relevance scoring by jointly encoding query-document pairs. This two-stage retrieval approach balances efficiency with accuracy.
The retrieved best-matching content is then passed as context to the LLM along with the user’s query to generate responses grounded in Thomson Reuters’s proprietary content. This integration of internal content with LLM capabilities has been instrumental in enabling users to extract relevant and insightful results while sparking ideas for AI-enabled solutions across business workflows.
Open Arena adopts a tile-based interface design with pre-set enabling tiles for different experiences. This approach simplifies user interaction and makes the platform accessible to employees without technical backgrounds.
The platform offers several distinct interaction modes:
These pre-set tiles cater to specific user requirements while simplifying navigation within the platform. The design choice to create task-specific interfaces rather than a generic chat interface helps guide users toward productive experimentation and accelerates use case discovery.
The platform achieved significant adoption within the first month of launch, with over 1,000 monthly internal users from Thomson Reuters’s global operations. Users averaged approximately 5-minute interaction sessions, indicating meaningful engagement rather than superficial exploration.
User testimonials highlight several key benefits:
The platform has served as an effective sandbox for AI experimentation, allowing teams to identify and refine AI applications before incorporating them into production products. This approach accelerates the development pipeline by validating concepts before significant engineering investment.
The team indicates ongoing development to add features and enhance platform capabilities. Notably, they mention plans to integrate AWS services like Amazon Bedrock and Amazon SageMaker JumpStart, which would expand access to additional foundation models including those from Anthropic, AI21 Labs, Stability AI, and Amazon’s own Titan models.
Beyond platform development, Thomson Reuters is actively working on “productionizing the multitude of use cases generated by the platform,” suggesting that Open Arena has successfully served its purpose as an innovation catalyst. The most promising experiments are being developed into production AI features for customer-facing products.
While this case study presents an impressive rapid development story, it’s important to note several considerations:
The 6-week development timeline is notable but should be understood in context—this is an internal experimentation platform, not a production customer-facing system. The compliance, testing, and reliability requirements for internal tools are typically less stringent than external products.
The metrics provided (1,000 monthly users, 5-minute average sessions) are useful but limited. There’s no information about conversion rates from experimentation to actual production use cases, or quantitative measures of business value generated.
The heavy emphasis on AWS services reflects the collaborative nature of this case study between Thomson Reuters and AWS, which was published on an AWS blog. While the architectural choices appear sound, alternative approaches using other cloud providers or open-source infrastructure are not discussed.
Nevertheless, the approach of creating a centralized, secure LLM experimentation environment represents a practical strategy for enterprises looking to foster AI innovation while maintaining governance and security controls. The modular, serverless architecture provides a template for organizations looking to build similar capabilities.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.