Jockey: Building a Scalable Conversational Video Agent with LangGraph and Twelve Labs APIs

LLMOps Database

Media & Entertainment

Jockey

Company

Jockey

Title

Building a Scalable Conversational Video Agent with LangGraph and Twelve Labs APIs

Industry

Media & Entertainment

Link

https://blog.langchain.dev/jockey-twelvelabs-langgraph/

Year

2024

Summary (short)

Jockey is an open-source conversational video agent that leverages LangGraph and Twelve Labs' video understanding APIs to process and analyze video content intelligently. The system evolved from v1.0 to v1.1, transitioning from basic LangChain to a more sophisticated LangGraph architecture, enabling better scalability and precise control over video workflows through a multi-agent system consisting of a Supervisor, Planner, and specialized Workers.

Tags

## Overview Jockey is an open-source conversational video agent developed by Twelve Labs that demonstrates a sophisticated approach to deploying LLM-powered multi-agent systems for video processing tasks. The project showcases how modern agentic frameworks can be combined with specialized domain APIs to create production-ready intelligent applications. This case study is particularly relevant for LLMOps practitioners interested in multi-agent architectures, workflow orchestration, and the practical challenges of deploying conversational AI systems that interact with external media processing services. The core value proposition of Jockey is enabling natural language interaction with video content—users can issue conversational commands to search, edit, summarize, and generate text from videos without needing to understand the underlying video processing APIs. This represents a growing pattern in LLMOps where LLMs serve as an intelligent interface layer between users and complex backend services. ## Technical Architecture and Multi-Agent Design Jockey's architecture represents a significant evolution in how LLM-based agents can be structured for production use. The system migrated from LangChain's legacy AgentExecutor in v1.0 to LangGraph in v1.1, which reflects a broader trend in the LLMOps community toward more controllable and debuggable agent frameworks. The multi-agent architecture consists of three primary components working in concert: **The Supervisor** acts as the central coordinator and routing mechanism. It receives user input, determines the appropriate next action, and manages the overall workflow state. The Supervisor handles error recovery and ensures the system follows the current plan or initiates replanning when circumstances change. This centralized coordination pattern is important for production systems because it provides a single point of control for monitoring, debugging, and intervention. **The Planner** is invoked by the Supervisor for complex requests that require multi-step execution. This component breaks down intricate video processing tasks into manageable steps that can be executed by specialized workers. The separation of planning from execution is a key architectural decision that enables the system to handle complex workflows while maintaining clarity about what steps are being taken and why. **The Workers** section contains specialized agents that execute specific tasks. An Instructor component generates precise task instructions for individual workers based on the Planner's strategy. The actual workers include specialized capabilities for Video Search, Video Text Generation, and Video Editing—each interfacing with Twelve Labs' APIs to perform their respective functions. This hierarchical multi-agent design provides several LLMOps benefits. First, it enables granular control over each step of the workflow, allowing precise management of information flow between nodes. Second, it optimizes token usage by ensuring only relevant context is passed to each component. Third, it facilitates debugging and monitoring since each node has a clear responsibility and can be observed independently. ## LangGraph Framework Advantages The transition to LangGraph represents a meaningful shift in how the agent's control flow is managed. Unlike more opaque agent frameworks, LangGraph provides a graph-based abstraction where nodes represent processing steps and edges define the flow of information between them. This visualization capability is particularly valuable for production systems where understanding agent behavior is critical. LangGraph's built-in persistence layer enables several important production patterns. Human-in-the-loop approval can be implemented before task execution, which is essential for applications where automated video editing might have significant consequences. The framework also supports "time travel" functionality, allowing operators to edit and resume agent actions from previous states—a powerful debugging and recovery mechanism for production deployments. The framework facilitates handling of real-world interaction patterns that are often overlooked in prototype systems. Double-texting handling manages scenarios where users provide new inputs while the agent is still processing previous requests. Async background jobs support long-running video processing tasks that may take significant time to complete. These patterns are essential for production-grade conversational systems but are often absent from simpler agent implementations. ## State Management and Workflow Control One of the emphasized advantages of the LangGraph-based architecture is the granular control over state management. In production LLM applications, managing what context is available to each component is crucial for both correctness and cost optimization. Jockey's architecture allows precise specification of which information is passed between nodes and how node responses contribute to the overall state. The data-flow architecture processes queries through a decision-making pipeline that first analyzes query complexity, then routes to either a simple text response path or a more complex chain of video processing steps. This intelligent routing prevents unnecessary API calls and processing for simple queries while still enabling sophisticated multi-step workflows when needed. State extension capabilities allow developers to add new fields to the state object or modify how existing state information is processed between components. This extensibility is important for integrating Jockey with other systems or handling specialized video metadata that might be specific to particular deployment contexts. ## Integration with Specialized Video APIs The integration with Twelve Labs' video understanding APIs demonstrates a common LLMOps pattern where LLMs orchestrate calls to specialized domain services. Twelve Labs provides video search, classification, summarization, question answering, and other capabilities powered by video foundation models (VFMs) that work with video natively rather than relying on pre-generated captions or transcripts. This architecture pattern—LLM as orchestrator of specialized services—is increasingly common in production systems. The LLM provides the natural language understanding and planning capabilities, while specialized models and APIs handle domain-specific tasks that require different modalities or expertise. This separation of concerns allows each component to be optimized for its specific role. The combination enables use cases like content discovery, video editing automation, interactive video FAQs, and AI-generated highlight reels. These applications require both the conversational understanding that LLMs provide and the video-specific intelligence that specialized video models offer. ## Deployment and Scalability Considerations The case study discusses deployment options at both development and production scales. For development and testing, Jockey can be deployed locally, enabling rapid iteration and debugging. For production deployments, LangGraph Cloud provides scalable infrastructure purpose-built for LangGraph agents. LangGraph Cloud handles horizontally-scaling servers and task queues to efficiently manage concurrent users and large states. This is particularly important for video processing applications where individual requests may involve substantial data and processing time. The infrastructure manages the complexity of maintaining conversation state across potentially long-running video processing operations. Integration with LangGraph Studio provides visualization and debugging capabilities for agent trajectories, enabling developers to understand how agents are behaving in production and iterate quickly on improvements. This observability is essential for production LLMOps—without clear visibility into agent behavior, diagnosing and fixing issues becomes extremely difficult. ## Customization and Extensibility The modular design of Jockey facilitates several types of customization that are relevant for production deployments: **Prompt-as-a-Feature** leverages the language model's capabilities to introduce new functionalities without modifying code. For example, prompts can instruct the system to identify and extract specific scene types from videos without changing the core system. This approach enables rapid experimentation and feature development. **Prompt Modification** allows fine-tuning of the decision-making process and output generation by editing prompts used by the Supervisor, Planner, or Workers. This is a common LLMOps pattern where behavior changes can be deployed as configuration rather than code changes. **Worker Addition and Modification** enables creation of new specialized Workers for tasks like advanced video effects or video generation, modification of existing Workers to enhance capabilities or integrate with new APIs, and implementation of custom logic for handling new types of tasks. These extensibility mechanisms are important for adapting the system to specific deployment contexts and use cases while maintaining the core architectural benefits of the multi-agent design. ## Critical Assessment While the case study presents compelling architectural patterns, several considerations are worth noting for practitioners evaluating similar approaches. The system's effectiveness depends heavily on both the LLM's planning capabilities and the quality of the underlying Twelve Labs APIs—failures or limitations in either component will impact the overall user experience. The multi-agent architecture introduces complexity that may not be necessary for simpler use cases. The overhead of Supervisor coordination, planning, and instruction generation adds latency and token costs that may not be justified for straightforward video queries. Production deployments would need to carefully tune the routing logic that decides when to use simple responses versus complex multi-step workflows. The reliance on LangGraph Cloud for production scalability means deployments are tied to that specific infrastructure, which may or may not align with organizational requirements for vendor independence or specific deployment environments. However, the open-source nature of Jockey itself provides flexibility for organizations willing to manage their own infrastructure. Overall, Jockey demonstrates mature patterns for building production-grade conversational agents that orchestrate complex workflows and integrate with specialized domain services—patterns that are broadly applicable beyond video processing use cases.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source