ZenML

Video Content Summarization and Metadata Enrichment for Streaming Platform

Paramount+ 2023
View original source

Paramount+ partnered with Google Cloud Consulting to develop two key AI use cases: video summarization and metadata extraction for their streaming platform containing over 50,000 videos. The project used Gen AI jumpstarts to prototype solutions, implementing prompt chaining, embedding generation, and fine-tuning approaches. The system was designed to enhance content discoverability and personalization while reducing manual labor and third-party costs. The implementation included a three-component architecture handling transcription creation, content generation, and personalization integration.

Industry

Media & Entertainment

Technologies

Overview

This case study documents a partnership between Paramount+ (Paramount Streaming) and Google Cloud Consulting to implement generative AI solutions for enhancing the user experience on the Paramount Plus streaming platform. The presentation was delivered by representatives from both organizations: James McCabe (Consulting Account Lead for Media Entertainment at Google Cloud), Sophian Saputra (Technical Account Manager for Media Entertainment at Google Cloud), Teresa (Lead Product Manager at Paramount), and Adam Ly (VP of ML Engineering and Personalization at Paramount Streaming).

The core philosophy driving this collaboration centers on a “user first” approach—the Paramount AIML team begins by focusing on user experience and then works backward to technical solutions. This customer-centric methodology guided the development of two primary generative AI use cases: video summarization and video metadata extraction, both aimed at improving content personalization and discovery on the streaming platform.

Business Context and Strategic Alignment

Paramount’s strategic objectives are clearly defined: expand subscriber base, retain viewers, boost engagement, and drive profitability. Within the media and entertainment landscape, they position AI not as an added feature but as a critical component at the center of their strategy. The partnership with Google Cloud enables a data-driven streaming service that understands and anticipates user preferences through AI-driven insights.

The practical motivation for these AI projects is substantial. Paramount+ has over 50,000 videos requiring summaries and metadata tagging, which translates into thousands of hours of manual labor. Additionally, metadata procurement from third-party providers is costly and often yields results lacking the detail required for effective personalization. By automating these tasks, Paramount aims to recover substantial time for creative and strategic pursuits while gaining greater control over the nuances of their content.

Metadata serves a dual role in their architecture: powering machine learning algorithms that fuel recommendation systems, and providing viewers with the content information they need for discovery. The “video DJ” concept they describe involves creating personalized collections that feel curated like a story with a beginning, middle, and end—understanding viewer preferences for Christmas tales, Halloween frights, or Westerns even when original content did not come with this metadata.

Google Cloud Consulting Engagement Model

Google Cloud Consulting employed a structured approach called “GenAI Jump Starts” to develop MVP solutions demonstrating the capabilities of Google’s generative AI. Before implementation, they aligned on Paramount’s GenAI objectives through briefing meetings, cadence calls, and brainstorming sessions. Training was provided through on-demand courses, hackathon jams, and virtual lunch-and-learns to build organizational knowledge around generative AI capabilities.

The Jump Start engagement was highly structured to deliver MVPs quickly. Key preparation steps included ensuring data availability in digestible formats—they identified 10 sample video assets representing different durations, genres, and release dates as a good representation of Paramount Plus films. They collaboratively decided on LLM output formats before writing prompts, iterating on both outputs and metadata fields throughout the engagement.

For this project, they specifically used videos from Paramount’s public YouTube channel, which allowed them to focus more on the generation phase rather than dealing with complex content access issues. This pragmatic approach to source material selection is a useful lesson for organizations beginning similar projects.

Technical Implementation and Architecture

The reference architecture consists of three main components: transcription creation, generation phase, and personalization integration.

Transcription Creation

The transcription pipeline begins with source videos in cloud storage buckets. A demuxing process extracts audio, selecting the highest quality available (most videos are in HLS format—HTTP Live Streaming). The speech-to-text (STT) process serves as a fallback method when transcripts don’t exist or are of insufficient quality. Importantly, Paramount maintains flexibility in their model choices—they can use Google managed service models or open-source alternatives. They specifically mentioned running a containerized Whisper distill version 3 as its own service, demonstrating a preference for open-source models where appropriate.

Generation Phase

The generation phase is triggered by events when transcripts land in storage buckets. The generation process stores outputs in Firestore as key-value pairs (with content ID as the key), making them available for updates within the personalization system. These updates may connect to backend processes outside the immediate system, such as content management or media asset management systems.

During the Jump Start, significant effort was invested in developing system prompts and input prompts, which were later templatized for full control over execution in production. The prompt engineering process involved techniques like few-shot prompting (providing specific examples) and function calling to connect to external solutions like IMDb for structured information retrieval.

Personalization Integration

The personalization component joins generated data back to clickstream data (user interactions on the application), adding enhancements and making them available for future updates. Beyond display use cases, summarization directly informs personalization features—a concise overview of each asset is a critical component of the visual presentation that influences how customers select movies or shows.

A particularly interesting aspect is the creation of embeddings from transcripts. They employ LoRA (Low-Rank Adaptation) fine-tuning to shrink the embedding space to their required size. The rationale is creating the most information-dense space possible while enabling smaller models to run on single GPUs. For transcripts specifically, they mentioned potentially using Gemma 2B instruct with a grammar to show the system what a transcript should look like.

LLMOps Best Practices and Lessons Learned

Prompt Engineering Strategies

The team emphasized that LLM response effectiveness depends heavily on well-detailed, clear prompts with appropriate constraints. Several key techniques emerged:

Handling Production Challenges

Several production-oriented challenges were addressed during development:

Iterative Development and Feedback

The engagement required four different iterations to refine the MVP prompt to meet Paramount’s expectations. Daily communication and in-depth working sessions enabled gathering feedback early and often. Getting access to videos and expected metadata output before the engagement began allowed them to “hit the ground running” with a clear north star for the solution.

Experimenting with videos of different lengths, content types, and time periods helped address difficult scenarios such as long transcripts, films with mature themes, and determining when to integrate function calling for external data sources.

Fine-Tuning and Continuous Improvement

The architecture supports continuous feedback loops essential for adaptive personalization systems. On Google Cloud, the reinforcement learning fine-tuning process is supported by model types like T5 and Gemma. The workflow involves:

The team noted that fine-tuning typically outperforms pure prompt engineering, making it an important roadmap item despite prompt engineering successes.

Future Directions and Production Roadmap

Several improvements were discussed for post-Jump Start implementation:

Balanced Assessment

This case study presents a well-structured approach to implementing LLMs in production for a specific media use case. The collaboration between Google Cloud Consulting and Paramount appears to have followed sound practices: starting with clear business objectives, developing MVPs before full implementation, and establishing feedback loops for continuous improvement.

However, as a presentation at what appears to be a Google Cloud event, some claims should be viewed with appropriate skepticism. The specific quantitative results (cost savings, time savings, engagement improvements) are not provided, making it difficult to assess the actual impact beyond validation of technical hypotheses. The “50,000 videos requiring processing” provides context for the scale of the problem, but outcomes are described in terms of validated approaches rather than measured improvements.

The technical architecture described is sound and reflects production-ready thinking, with appropriate attention to fallback mechanisms (STT when transcripts are unavailable), event-driven processing, and integration with existing systems. The emphasis on maintaining flexibility between managed services and open-source models (Whisper) suggests pragmatic decision-making rather than vendor lock-in.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Foundation Model for Unified Personalization at Scale

Netflix 2025

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

content_moderation classification summarization +37

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64