MLflow addresses the challenges of moving LLM agents from demo to production by introducing comprehensive tooling for tracing, evaluation, and experiment tracking. The solution includes LLM tracing capabilities to debug black-box agent systems, evaluation tools for retrieval relevance and prompt engineering, and integrations with popular agent frameworks like Autogen and LlamaIndex. This enables organizations to effectively monitor, debug, and improve their LLM-based applications in production environments.
This case study comes from a presentation by Ben, an MLflow maintainer and software engineer at Databricks with over six and a half years of experience at the company. The presentation focuses on the practical challenges of moving from prototype GenAI agents to production-ready systems, and how MLflow has evolved its tooling to address these challenges. The talk was part of a broader discussion on MLflow in the context of generative AI, specifically focusing on the operational aspects of deploying and maintaining LLM-based agents in real-world scenarios.
The presentation begins by contextualizing the challenge: companies are collecting data at an “alarming rate,” but making that data accessible to end users who aren’t specialized in writing code, queries, or building visualizations remains difficult. Traditional analytics approaches restrict data access to technical specialists, while business users want quick, contextually accurate answers based on company data.
Agentic frameworks powered by large language models offer a potential solution by enabling natural language interfaces to complex data systems. However, the speaker identifies a critical gap: while building a demo agent might take an afternoon, moving to a production-grade system that can be deployed for real business use is substantially more complex.
The presentation outlines several specific pain points that the MLflow team identified through their own experience (“dog-fooding”) of building GenAI systems:
Black Box Debugging: When building agents that interface with LLM APIs, the system effectively becomes a black box. You define what the agent should do, but you can’t see what’s happening for individual requests. This makes debugging extremely difficult because you’re dealing with a non-deterministic system where it’s unclear why certain decisions are being made or why responses are generated in particular ways.
Retrieval Relevance Complexity: For RAG (Retrieval-Augmented Generation) systems, there are numerous considerations around document chunking, context size, and retrieval configuration. Chunks that are too large become expensive and slow to process; chunks that are too small lose important context. Determining how many documents to retrieve and what to do with them requires extensive experimentation.
Prompt Engineering: Optimizing the prompts sent to language models is time-consuming and requires significant iteration. The speaker acknowledges this as a major area of complexity that takes considerable effort to get right.
Fast Iteration Tracking: Unlike traditional ML development, GenAI development moves extremely fast. You can test thousands of variations in a short time, and it’s easy to lose track of what worked best. The speaker describes scenarios where something 40 or 50 iterations ago actually performed better than the current approach, but reconstructing that state becomes nearly impossible without proper tracking.
Framework Selection: The GenAI ecosystem is extremely crowded with providers, tools, and frameworks. The speaker mentions that if they listed all the companies they’ve worked with or tested, it would fill 50 slides. This fragmentation makes it difficult for teams to know where to start or which tools best fit their use cases.
The MLflow team has spent approximately a year focusing on building tooling to support GenAI workflows. Their approach leverages MLflow’s existing strengths in experiment tracking while adding new capabilities specifically designed for the unique challenges of agentic systems.
The centerpiece of their GenAI tooling is MLflow Tracing, which provides visibility into what was previously a black box. Tracing allows users to see exactly what each input is going to each stage of an agent, understand the decision-making process, and analyze execution to make informed modifications.
The presentation includes demonstrations showing the tracing UI, where users can see:
A more sophisticated example shows tracing for AutoGen, a multi-turn agentic framework. The trace maps out everything the agent has done, including the initiate chat step, different assistants defined within the system, and individual calls to language models. Importantly, metadata parameters are logged as attributes, so users can see which model was used and what configurations were passed—critical information for iterating and improving responses.
MLflow offers automatic tracing through its AutoLog feature, which eliminates the need for manual instrumentation. Users simply call AutoLog and MLflow automatically applies all necessary instrumentation to wrap calls for supported agentic frameworks. The presentation mentions official support for:
These integrations are being added to the top-level namespace of MLflow to simplify usage.
MLflow Evaluate has been extended to support RAG evaluation, including:
The speaker acknowledges the competitive landscape here, mentioning TruLens, Giskard (which is actually a plugin for MLflow), and Ragas. They note that much of the industry is converging on using “LLM as judge” approaches for evaluation.
Interestingly, the team is working on making evaluation prompts callable and more open, specifically so they can integrate with other tools’ evaluation functions. This represents a philosophy of avoiding lock-in and making MLflow more interoperable with the broader ecosystem.
MLflow’s existing tracking capabilities extend naturally to GenAI workflows. Every iteration can be snapshotted and logged to the tracking server, capturing:
This addresses the fast iteration problem by maintaining a “source of truth” of what has been tested over time, making it possible to return to previous states or compare performance across experiments.
The presentation includes a compelling demonstration where an agent answers a complex question about how many Blu-ray discs would be needed to store certain data. The agent uses Wikipedia as its retrieval corpus and performs multiple steps:
This example illustrates the complexity of real-world agents that need to orchestrate multiple retrieval operations, tool calls, and generation steps to answer abstract questions. The MLflow integration makes it simple to log these complex agents and track their state over time.
When asked how MLflow Tracing compares to LangSmith’s traces, the speaker is candid: “They’re fairly similar in functionality and it’s not really a differentiator—we think of it as table stakes.” They expect functionality to converge across platforms as tracing is such an immediate need for production use cases.
What differentiates MLflow, according to the speaker, is the unified ecosystem: tracking, tracing, evaluation, and deployment all in one platform. The goal is a “single unified experience” rather than requiring users to stitch together multiple tools.
The team is actively working on several enhancements:
The speaker mentions a blog post about using AutoGen for automatic prompt generation for DALL-E 3 integration, which shows how to log generated images and prompt-to-image index mappings.
While the presentation effectively highlights real challenges in productionizing GenAI agents, it’s worth noting some limitations:
The talk is primarily focused on demonstrating MLflow’s capabilities rather than providing concrete metrics on how these tools have improved production outcomes. There’s no discussion of specific customer deployments or quantified benefits (e.g., reduction in debugging time, improvement in retrieval quality).
The acknowledgment that tracing functionality is “table stakes” across platforms suggests this isn’t a unique differentiator. The value proposition rests more on the unified experience and existing MLflow adoption rather than revolutionary new capabilities.
Additionally, while the speaker mentions extensive “dog-fooding,” the presentation doesn’t go deep into specific production learnings or war stories that might provide more practical guidance for teams facing these challenges.
That said, the presentation provides valuable insight into the state of LLMOps tooling and the specific challenges that open-source projects like MLflow are trying to address for the growing community of teams building production GenAI systems.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.