The presentation addresses the critical challenge of debugging and maintaining agent AI systems in production environments. While many organizations are eager to implement and scale AI agents, they often hit productivity plateaus due to insufficient tooling and observability. The speaker proposes a comprehensive rubric for assessing AI agent systems' operational maturity, emphasizing the need for complete visibility into environment configurations, system logs, model versioning, prompts, RAG implementations, and fine-tuning pipelines across the entire organization.
This case study is derived from a presentation by the CEO and co-founder of Valohai, an MLOps platform company, discussing the challenges of deploying and maintaining agentic AI systems in production. While the speaker represents Valohai, the presentation is more of an industry thought leadership piece rather than a specific customer case study. The content provides valuable insights into the current state of agentic AI adoption and proposes a maturity assessment framework for organizations looking to operationalize LLM-based agents.
It’s worth noting that this is essentially a vendor perspective, and while Valohai offers an MLOps platform, the speaker deliberately avoids direct product pitching and instead focuses on providing a conceptual framework that could be applied regardless of tooling choices. This balanced approach lends credibility to the observations, though the underlying motivation is clearly to position their platform as a potential solution.
The presentation opens with an informal audience survey that reveals telling insights about the industry’s maturity with agentic AI. According to the speaker’s assessment:
This gap between aspiration and execution forms the core problem the presentation addresses. The speaker notes that “everybody is talking about it, everybody is kind of preparing, everybody wants to use it but still we’re pretty far from production in most organizations.”
A central thesis of the presentation is that when organizations invest in machine learning and agentic AI, they expect linear or even exponential growth in productivity as they add more people, compute, and investment. However, what actually happens is that productivity plateaus despite increased resource allocation.
The speaker identifies several root causes for this plateau:
The speaker estimates that organizations can spend 3-4 weeks just trying to identify the source of an issue in complex agentic AI systems, with emails going back and forth between DevOps, software engineering, data science, and external consultants.
The core contribution of this presentation is a rubric for assessing the maturity of an organization’s agentic AI MLOps practices. The key principle is that for any given response that an agentic AI system produces, any team member (not just DevOps or ML engineers) should be able to immediately access and understand the complete context of how that response was generated.
The rubric covers the following dimensions:
Can team members immediately identify all environments involved in generating a response? This includes:
Are the following accessible for any given response?
For any response, can teams identify:
Can teams access:
Is the actual inference code running in production accessible and traceable?
For systems using retrieval-augmented generation:
Can teams access:
For fine-tuned models in production:
The speaker draws an important contrast between debugging traditional software and debugging agentic AI systems. In traditional development, fixing a production bug typically involves going to a single machine, checking system logs, finding a traceback, fixing the code, and committing to git. The workflow is well-understood and relatively straightforward.
In contrast, agentic AI debugging requires navigating a complex combination of:
This complexity is why the speaker argues that organizations need to build systematic approaches to observability and reproducibility.
The first example describes what the speaker identifies as “probably the most common use case for agentic AI today”—a B2B RPA (Robotic Process Automation) workflow:
When a procurement team reports that invoices from a specific vendor are consistently processed incorrectly, debugging becomes extremely challenging. The error could exist at any point in the pipeline—the initial decision model, the OpenAI processing, the RAG containing bookkeeping procedures, or the SAP integration. Without complete lineage and logging, organizations spend weeks sending emails between DevOps, software engineering, data science, and consultants just to identify where the problem might exist.
The second example involves a consumer-facing application with:
When one segment of customers experiences consistently low performance, debugging requires access to RAG history, the ability to reproduce retrieval code, understanding of the prompts in use, and more. Without this information, the system becomes what the speaker describes as “glue code that requires multiple very specific people to figure out,” significantly slowing operations.
The third example extends the architecture to include LoRA (Low-Rank Adaptation) fine-tuned models instead of just RAG systems. The added complexity of fine-tuning introduces additional debugging challenges—for instance, when the entire API stops responding, the root cause could be GPU memory exhaustion on any of the machines running fine-tuned models.
The speaker emphasizes that there is no “one-size-fits-all” solution for agentic AI MLOps. When asked about specific tool recommendations, they deliberately avoid pitching their own platform and instead recommend using the rubric as a diagnostic tool:
The underlying message is that building intuition about what “good” looks like for agentic AI operations is more important than adopting any specific tool. Just as humans have learned over their lifetimes to quickly assess whether a house is in good or bad condition, ML practitioners need to develop similar intuition for assessing the health of their AI operations.
While this presentation offers valuable conceptual frameworks, it’s important to note several limitations:
That said, the challenges described—debugging complexity, cross-team coordination overhead, and the difficulty of maintaining reproducibility in multi-agent systems—align with widely reported industry challenges. The proposed rubric provides a practical starting point for organizations looking to assess their readiness for production agentic AI deployments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.