ZenML

Automated Sports Commentary Generation using LLMs

WSC Sport 2023
View original source

WSC Sport developed an automated system to generate real-time sports commentary and recaps using LLMs. The system takes game events data and creates coherent, engaging narratives that can be automatically translated into multiple languages and delivered with synthesized voice commentary. The solution reduced production time from 3-4 hours to 1-2 minutes while maintaining high quality and accuracy.

Industry

Media & Entertainment

Technologies

Overview

WSC Sport, a sports technology company with approximately 400 employees operating globally, has developed an automated AI-powered sports commentary system that generates complete game recap videos with synthetic narration. The company works with major sports leagues including NBA, NFL, NHL, Bundesliga, and Premier League. This case study, presented by Alik who leads the NLP team at WSC Sport, demonstrates how they built a production system that reduces the time to create narrated sports recaps from 3-4 hours to approximately 1-2 minutes.

The fundamental problem they’re solving is meeting the demands of modern sports consumers, particularly younger audiences who don’t want to watch entire games but instead prefer quick 5-10 minute summaries that tell the complete story with all the key moments and interesting statistics. Traditional production of such content requires human commentators who need to know all the game information, statistics, player backgrounds (such as returns from injury), and game momentum—a process that traditionally takes hours including data collection, script writing, studio recording, and quality assurance.

System Architecture

The production pipeline consists of several key components working together:

Highlight Generation: WSC Sport’s core technology automatically identifies and extracts key moments from sports broadcasts within minutes. This existing capability provides the foundation—the system already knows what events occurred and can rate their significance.

Script Generation (Roger): The LLM-based component they internally call “Roger” takes all the event data and background knowledge to write coherent scripts. This includes understanding what happened before and during the game, momentum shifts (like comebacks from deficits), interesting statistics, and notable achievements (such as record-breaking performances like LeBron’s historic moments that everyone talks about).

Audio Synthesis (John): The text-to-speech component they call “John” serves as the synthetic commentator. It must stay synchronized with the video content, know when to inject emotion and excitement, and produce natural-sounding narration that matches what’s happening on screen.

Translation Layer: An additional model handles translation to multiple languages including Spanish, French, Portuguese (Brazilian), Turkish, and Polish—each with its own challenges and nuances.

Data Structure and Event Representation

The system works with rich structured data about each game event. For each moment, they have access to:

This structured metadata is crucial because, as they emphasize throughout the presentation, they explicitly provide this information to the model rather than having it guess—a key strategy for reducing hallucinations.

Evolution of Approaches

Naive Approach (Zero-Shot): Their initial attempt was straightforward—feed all events with their details into the model and generate the complete script in one pass. This approach failed for several reasons:

Sequential Approach: They moved to processing each event in isolation, which provided better control but introduced new challenges.

Key Technical Challenges

Repetition (Petition): When generating a 2-minute recap, certain phrases would repeat. For example, every replay might trigger “let’s take another look” which becomes grating to viewers. The system needed to vary its language while maintaining appropriate context.

Logical Errors: The model needed to understand sports commentary conventions. For example, it’s nonsensical to say a team is “still leading” less than a minute into a game—there’s no “still” about it. The model needed to understand the logical context of sports timing.

Hallucinations: This was described as a particularly difficult problem. Examples included:

Three-Pillar Solution

1. System Prompt Engineering

Their system prompt follows three guiding principles:

2. Dynamic Prompt Instructions

To combat repetition and hallucinations, they built a system where the few-shot examples in the prompt are dynamically selected based on the current event being described. The system indexes examples by:

For each incoming event, they retrieve relevant examples and perform random sampling from the matched pool. This significantly reduced repetitiveness in the output.

They are currently developing a more semantic approach to this retrieval, moving beyond simple indexing because as they scale across all sports, there are many different attributes that vary between sports, making pure indexing approaches unwieldy.

3. Structured Metadata

Rather than describing events in natural language and hoping the model extracts the right information, they explicitly provide structured data: what the action was, what parameters applied, what happened after the action, any statistics—all in an explicit format to minimize model errors.

Hallucination Detection and Guardrails

Even with all the above measures, hallucination remained a significant challenge. They implemented a guardrailing process using Chain of Thought (CoT) prompting for detection.

The example given: An event was described as having a “steal” (ball interception) when no steal actually occurred. Their solution:

The presentation emphasized that while Chain of Thought is well-documented in academic papers, translating it to practical production use required careful prompt engineering to structure the verification process appropriately.

Production Deployment Considerations

The system is currently deployed in production, visible in the NBA app with Spanish, French, and Portuguese commentary. The presenter mentioned they also produce content for other major leagues globally.

Key production insights shared:

Future Directions

The presentation hinted at several expansion areas:

Critical Assessment

While the presentation demonstrates impressive capabilities, it’s worth noting this is primarily a product pitch. The claimed reduction from 3-4 hours to 1-2 minutes is dramatic but likely represents best-case scenarios. The hallucination challenges they describe remain inherent to LLM-based generation, and while their mitigation strategies are sound, the presentation doesn’t provide metrics on error rates or human oversight requirements in production. The system’s effectiveness likely varies across different sports, languages, and edge cases not covered in a conference demo.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57