AI21: Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

Company

AI21

Title

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

Industry

Tech

Link

https://www.youtube.com/watch?v=JRdElyOgloQ

Year

2025

Summary (short)

AI21 Labs evolved their production AI systems from task-specific models (2022-2023) to RAG-as-a-Service, and ultimately to Maestro, a multi-agent orchestration platform. The company identified that while general-purpose LLMs demonstrated impressive capabilities, they weren't optimized for specific business use cases that enterprises actually needed, such as contextual question answering and summarization. AI21 developed smaller language models fine-tuned for specific tasks, wrapped them with pre- and post-processing operations (including hallucination filters), and eventually built a comprehensive RAG system when customers struggled to identify relevant context from large document corpora. The Maestro platform emerged to handle complex multi-hop queries by automatically breaking them into subtasks, parallelizing execution, and orchestrating multiple agents and tools, achieving dramatically improved quality with full traceability for enterprise requirements.

## Overview This case study documents AI21 Labs' journey in productizing LLM-based systems for enterprise use, as described by Guy Becker, Group Product Manager at AI21 Labs. The narrative spans from late 2022 through 2025 and covers three major product evolutions: task-specific models, RAG-as-a-Service, and the Maestro multi-agent orchestration platform. The discussion provides valuable insights into the practical challenges of operationalizing LLMs, the importance of evaluation frameworks, and the evolution from simple model APIs to complex AI systems designed for production enterprise environments. ## The Task-Specific Models Era (Late 2022 - Early 2023) AI21's first production approach emerged just as ChatGPT launched. The team observed that while large language models demonstrated impressive capabilities across many tasks, they weren't particularly excellent at any single task that businesses actually needed. The disconnect between impressive demos (like generating poems where all words start with 'Q') and real business value became apparent through customer interactions. The company identified key use cases that enterprises genuinely cared about: contextual answering based on organizational documents, summarization of various sources including documents and transcripts, and other focused business applications. Rather than relying on general-purpose LLMs, AI21 developed smaller language models (SLMs) that were fine-tuned specifically for these tasks. This approach yielded several operational advantages: higher quality on specific use cases, reduced model footprint, lower latency, and lower inference costs. Critically, these weren't just fine-tuned models but "mini AI systems" that incorporated pre-processing and post-processing operations. A key example was the addition of hallucination filters that checked during generation whether the model's output was grounded in the input context. This systems-level thinking distinguished the approach from simply exposing model endpoints. The API design philosophy also differed from general-purpose LLMs. Rather than exposing free-form prompts where developers could write arbitrary instructions, AI21 created simpler, more constrained APIs where prompts were incorporated into the system itself. This required "reverse market education" as customers accustomed to ChatGPT-style interaction had to learn that these systems were intentionally simpler but more reliable. The trade-off was accepted because it delivered higher accuracy and reliability for specific enterprise use cases. ## The Transition to RAG-as-a-Service (2023) The best-selling task-specific model was "contextual answers," which took a question and organizational documents as input and produced responses grounded 100% in those documents to eliminate hallucinations. However, a critical operational challenge emerged: while the model performed excellently in controlled environments with a few documents, businesses struggled to identify the right context from their thousands or millions of documents. This challenge led AI21 to develop one of the first RAG-as-a-Service offerings, at a time (end of 2022/beginning of 2023) when RAG wasn't yet a widely known pattern. The team notes that example notebooks for vector search and dense embeddings were just beginning to appear, but there were no production services like OpenAI's file search or AWS knowledge bases available yet. Everything had to be built from scratch. ### Product Development Approach The product management approach emphasized starting with design partners to identify specific use cases, defining metrics to evaluate system performance, and building datasets to track improvements over time. The team explicitly rejected relying on "hunches" alone. At this early stage, LLM-as-judge evaluation methods weren't yet established, making rigorous evaluation framework development even more critical. The MVP strategy focused ruthlessly on a narrow scope: only TXT files were initially supported, despite demand for PDFs, DOCx files, and Excel spreadsheets. File format conversion was deliberately excluded from the initial productization, with solution architects helping customers convert files manually when needed. This allowed the team to focus on their core differentiator: the retrieval and generation logic itself. The initial system supported approximately 1GB of storage per customer, which for text files translated to hundreds of thousands of documents. The use case was limited to contextual question answering, not summarization or information extraction. Even within question answering, the team started with questions whose answers were localized in documents, then graduated to answers scattered within a single document, and only later added multi-document support. This gradual, feedback-driven approach helped prioritize the next capabilities. ### Technical Architecture Decisions Several technical decisions illustrate the production challenges of RAG systems in this early period: **Chunking and Segmentation**: Rather than simple length-based chunking, AI21 trained a semantic segmentation model that understood the semantic meaning of text segments. This improved the embedding and retrieval process by ensuring chunks were semantically coherent. **Embedding and Retrieval**: The team experimented with multiple embedding models and developed a retrieval pipeline that's now standard but was novel at the time. They had to determine optimal chunk sizes, whether to include neighboring chunks or full documents in retrieved context, and how much context language models could effectively process given the token limitations of 2023-era models. **Single-Turn vs. Multi-Turn**: Initially, the system only supported single-turn question-answer interactions, not conversational follow-ups. This reflected both the technical capabilities of early 2023 models and a deliberate focus on solving specific problems first. **Configuration and Customization**: A critical lesson emerged early: one size doesn't fit all for RAG systems. Different domains and use cases required different indexing and retrieval methods. This necessitated exposing configuration parameters that solution architects (and later customers directly) could tune, including: - Chunking strategies for documents - PDF parser selection - Chunk overlap amounts - Context retrieval scope (chunks only, neighboring chunks, or full documents) - Hybrid search weighting The team attempted to automatically optimize these parameters based on customer-provided datasets, which eventually influenced the evolution toward Maestro. **Metadata-Based Retrieval**: For certain document types, particularly financial documents like SEC filings where semantic similarity wasn't discriminative enough (many documents are structurally very similar), the team implemented metadata extraction and indexing. Metadata like company name and reporting period enabled filtering before semantic search, dramatically improving both quality and latency by reducing the context fed to language models. ### Deployment and Service Models Initial deployment was SaaS-based, but the team quickly realized that many customers with sensitive documents wouldn't send data over the internet for cloud storage. This drove support for VPC deployments and eventually on-premises installations. This deployment flexibility became critical for reaching larger enterprise customers with more interesting and sensitive use cases. The system also evolved from an integrated RAG endpoint (retrieval + contextual answers) to modular building blocks. Customers wanted use cases beyond question answering—summarization, document generation, and email composition based on retrieved context. AI21 responded by exposing separate endpoints for semantic search, hybrid search, and even semantic segmentation, allowing customers to compose their own RAG chains using high-quality components. ## The Maestro Multi-Agent Platform (2024-2025) The limitations of the RAG-as-a-Service approach became apparent with more complex queries. Simple RAG worked well for straightforward questions but struggled with multi-hop questions that required gathering information progressively, using intermediate answers to inform subsequent retrieval steps. AI21 observed this wasn't unique to their system—it was a general challenge across the industry. Additionally, customers using multiple task-specific models found it difficult to route queries to the appropriate model and combine them to solve complete use cases. This fragmentation motivated the development of AI21 Maestro, positioned as a multi-agent collaboration platform with emphasis on planning and orchestration. ### Architecture and Capabilities Maestro's core innovation is its planning and orchestration layer. The system takes complex queries or tasks and automatically breaks them into multiple subtasks, understanding what should be executed to fulfill the complete use case. It then orchestrates first-party agents (like instruction-following agents and RAG agents), tools, and even multiple LLMs that customers want to use. A concrete example illustrates the approach: for a question like "What has been the trend in hardware revenue between fiscal years 2022 and 2024 and the reasons for this trend?", Maestro decomposes this into: - Separate queries for hardware revenue in each fiscal year (2022, 2023, 2024) - Retrieval of relevant document sections for each year - Interim answers for each year - Separate identification of reasons for the trend - Final synthesis combining all gathered information Critically, the system automatically identifies which steps can be parallelized (the per-year queries) and which must be sequential, optimizing for latency while maintaining quality. The team reports that quality improves "dramatically" with this decomposition approach compared to sending the full question to an LLM or even a standard RAG system. ### Advanced RAG Capabilities Maestro extends RAG capabilities to handle structured querying of unstructured documents. For questions like "What has been the average revenue for NASDAQ biotech companies in 2024?", the system can: - Extract and structure metadata from unstructured financial documents during indexing - Store this information in a relational database - Generate SQL queries at inference time to answer the question - Effectively treat documents as if they were already in a structured database This represents an evolution of the metadata extraction approach used in the earlier RAG system, now systematically applied to enable structured querying across document collections. ### Enterprise Requirements and Explainability A critical production requirement for Maestro is explainability and traceability. Unlike black-box AI systems, Maestro provides full traces of every decision the system makes. This "clear box" approach addresses enterprise needs for visibility and explainability into AI system behavior, which is increasingly important for compliance, debugging, and building trust. The system also maintains quality guarantees, though the specific mechanisms aren't detailed in the conversation. The combination of better quality, better explainability, and the ability to handle complex multi-step reasoning positions Maestro as what Becker calls "the future of agents." ## LLMOps Lessons and Best Practices ### Evaluation-Centric Development When asked about the most important thing for an AI product, Becker immediately responds: "Evaluations. So, benchmarks, metrics, and goals." This evaluation-first mindset permeates AI21's approach. The team establishes metrics before building features, creates datasets to track performance over time, and uses evaluation results to guide prioritization. Importantly, Becker acknowledges that academic benchmarks don't correlate well with real-world performance. The solution is incorporating real-world data into evaluations, though this requires close collaboration with customers and design partners. ### Product Management in the AI Era The case study offers insights into product management for AI systems. Becker emphasizes that "PM 101 is the ability to say no"—ruthlessly scoping MVPs to ship quickly, gather feedback, and iterate. This is particularly important in AI where the temptation exists to build complex systems addressing every possible use case. However, AI product management differs from traditional software PM. The focus shifts from feature lists to quality metrics and benchmarks. The "contract" between product, engineering, and algorithm teams centers on defining which metrics to optimize rather than which features to build. This quality-centric rather than feature-centric approach reflects the nature of AI systems where improving performance on specific tasks often matters more than adding new capabilities. Becker also notes that developers working with LLMs increasingly need PM skills. The ease of building something with modern AI tools (what he calls "vibe coding") makes it tempting to code without clear goals. The discipline of defining use cases, establishing benchmarks, and measuring progress becomes crucial even for individual developers. ### Iterative Learning and Market Education Several examples illustrate the importance of market learning. The initial assumption that one RAG configuration could serve all customers proved incorrect—different domains and use cases required different parameters. The team responded by exposing configuration options and working closely with solution architects and customers to optimize them. Market education sometimes meant teaching customers that simpler was better. Users accustomed to ChatGPT's flexibility initially expected to instruct task-specific models with arbitrary prompts, requiring "reverse education" about the value of constrained, reliable interfaces. The evolution from task-specific models to Maestro also reflects learning that customers don't just need individual capabilities but orchestration across capabilities. The fragmentation of having multiple specialized models created friction that the platform approach addresses. ### Data Quality and Human Oversight Becker emphasizes that "benchmarks and data sets are curated by humans and humans make mistakes. Definitely a lot of mistakes." This acknowledgment underscores the importance of continuously validating evaluation frameworks and not blindly trusting benchmark performance as a proxy for real-world success. ### Deployment Flexibility as a Requirement The trajectory from SaaS-only to VPC to on-premises deployment illustrates how deployment flexibility can be a make-or-break requirement for enterprise AI. Sensitive use cases often can't tolerate cloud-based document storage, making alternative deployment models essential for market reach. ## Technical Evolution and Industry Timing The case study provides a valuable chronological perspective on LLMOps evolution. In late 2022/early 2023: - RAG patterns were just emerging with scattered examples, not established practices - Vector databases and embedding services were nascent - LLM-as-judge evaluation methods were experimental - No major cloud providers offered managed RAG services - Conversational/chat models weren't the default paradigm By 2025, AI21 operates in a world with "a million handbooks" for building AI systems, yet still faces novel challenges in multi-agent orchestration where established patterns don't fully apply. The company's approach of shipping incremental capabilities, learning from real deployments, and maintaining evaluation rigor has allowed them to evolve products through multiple paradigm shifts. The progression from fine-tuned SLMs to RAG-as-a-Service to multi-agent orchestration also reflects broader industry trends: initial enthusiasm for specialized models, recognition that context and retrieval are critical for grounding, and emerging focus on complex reasoning through decomposition and agent orchestration. AI21's journey provides a concrete example of how a company has navigated these shifts while maintaining production systems serving real enterprise customers.

Start deploying reproducible AI workflows today