DocuSign: Comprehensive Debugging and Observability Framework for Production Agent AI Systems

LLMOps Database

Tech

DocuSign

Company

DocuSign

Title

Comprehensive Debugging and Observability Framework for Production Agent AI Systems

Industry

Tech

Link

https://www.youtube.com/watch?v=cEcGoJrTptA&list=PLlcxuf1qTrwDDRUmXJA-x-uqp-qutke_x&index=18

Year

Summary (short)

The presentation addresses the critical challenge of debugging and maintaining agent AI systems in production environments. While many organizations are eager to implement and scale AI agents, they often hit productivity plateaus due to insufficient tooling and observability. The speaker proposes a comprehensive rubric for assessing AI agent systems' operational maturity, emphasizing the need for complete visibility into environment configurations, system logs, model versioning, prompts, RAG implementations, and fine-tuning pipelines across the entire organization.

Tags

document_processing

high_stakes_application

regulatory_compliance

## Overview This case study is derived from a presentation by the CEO and co-founder of Valohai, an MLOps platform company, discussing the challenges of deploying and maintaining agentic AI systems in production. While the speaker represents Valohai, the presentation is more of an industry thought leadership piece rather than a specific customer case study. The content provides valuable insights into the current state of agentic AI adoption and proposes a maturity assessment framework for organizations looking to operationalize LLM-based agents. It's worth noting that this is essentially a vendor perspective, and while Valohai offers an MLOps platform, the speaker deliberately avoids direct product pitching and instead focuses on providing a conceptual framework that could be applied regardless of tooling choices. This balanced approach lends credibility to the observations, though the underlying motivation is clearly to position their platform as a potential solution. ## Current State of Agentic AI Adoption The presentation opens with an informal audience survey that reveals telling insights about the industry's maturity with agentic AI. According to the speaker's assessment: - Many organizations are looking to implement agentic AI or LLMs - Very few have agentic AI in production - Even fewer are generating actual business value from deployed agents - A minimal number are ready to scale their agentic AI operations This gap between aspiration and execution forms the core problem the presentation addresses. The speaker notes that "everybody is talking about it, everybody is kind of preparing, everybody wants to use it but still we're pretty far from production in most organizations." ## The Productivity Plateau Problem A central thesis of the presentation is that when organizations invest in machine learning and agentic AI, they expect linear or even exponential growth in productivity as they add more people, compute, and investment. However, what actually happens is that productivity plateaus despite increased resource allocation. The speaker identifies several root causes for this plateau: - **Debugging overhead**: Significant time spent figuring out why agents aren't behaving correctly, locating relevant code and data for malfunctioning models - **Maintenance burden**: Updates to dependencies (like new PyTorch versions), memory issues that go unnoticed for days or weeks, and general infrastructure upkeep - **Onboarding friction**: Lack of tooling makes bringing new team members up to speed extremely time-consuming - **Project handover difficulties**: Transferring ownership of projects between teams is cumbersome without proper documentation and reproducibility - **Reinventing the wheel**: Teams repeatedly solve the same problems because there's no systematic way to share learnings The speaker estimates that organizations can spend 3-4 weeks just trying to identify the source of an issue in complex agentic AI systems, with emails going back and forth between DevOps, software engineering, data science, and external consultants. ## The Proposed Maturity Assessment Rubric The core contribution of this presentation is a rubric for assessing the maturity of an organization's agentic AI MLOps practices. The key principle is that for any given response that an agentic AI system produces, **any team member** (not just DevOps or ML engineers) should be able to immediately access and understand the complete context of how that response was generated. The rubric covers the following dimensions: ### Environment Reproducibility Can team members immediately identify all environments involved in generating a response? This includes: - Libraries and their versions - Operating system details - Docker images - All dependencies across the decision-making pipeline ### System Logs and Metrics Are the following accessible for any given response? - Tracebacks - Memory usage (CPU and GPU) - Latency metrics - All relevant system logs ### Model Versioning For any response, can teams identify: - The specific model binary used - The dataset used for training - Version information for base models - Information about adaptations like LoRA models ### Prompt Tracking Can teams access: - The exact prompts used for any given response - In multi-agent environments, all prompts across the entire graph of agents - Intermediate responses between collaborating agents ### Inference Code Is the actual inference code running in production accessible and traceable? ### RAG Lineage For systems using retrieval-augmented generation: - What was the RAG code that generated the context? - What was the actual retrieved content for a particular response? - What was the database state at query time? - Complete lineage through the RAG pipeline ### Guardrails and Evaluation Can teams access: - All guardrail testing results - Evaluation code and results before and during inference ### Fine-tuning Lineage For fine-tuned models in production: - What dataset was used? - What preprocessing was applied? - Complete lineage of the fine-tuning pipeline ## Comparison to Traditional Software Development The speaker draws an important contrast between debugging traditional software and debugging agentic AI systems. In traditional development, fixing a production bug typically involves going to a single machine, checking system logs, finding a traceback, fixing the code, and committing to git. The workflow is well-understood and relatively straightforward. In contrast, agentic AI debugging requires navigating a complex combination of: - Multiple interconnected models - Various data pipelines - Prompt configurations - RAG systems - Fine-tuning history - Infrastructure state - Cross-team dependencies This complexity is why the speaker argues that organizations need to build systematic approaches to observability and reproducibility. ## Practical Examples ### B2B RPA Use Case The first example describes what the speaker identifies as "probably the most common use case for agentic AI today"—a B2B RPA (Robotic Process Automation) workflow: - An email arrives containing an invoice - A model makes initial decisions about the invoice - Another model (potentially using OpenAI) processes the PDF content - A model with SAP tooling writes information to the ERP system - Another model writes to a separate SQL database When a procurement team reports that invoices from a specific vendor are consistently processed incorrectly, debugging becomes extremely challenging. The error could exist at any point in the pipeline—the initial decision model, the OpenAI processing, the RAG containing bookkeeping procedures, or the SAP integration. Without complete lineage and logging, organizations spend weeks sending emails between DevOps, software engineering, data science, and consultants just to identify where the problem might exist. ### Consumer Application with Customer-Specific RAG The second example involves a consumer-facing application with: - A main model - Sub-models it orchestrates - Customer-specific models - Customer-specific RAG systems - Database retrieval code When one segment of customers experiences consistently low performance, debugging requires access to RAG history, the ability to reproduce retrieval code, understanding of the prompts in use, and more. Without this information, the system becomes what the speaker describes as "glue code that requires multiple very specific people to figure out," significantly slowing operations. ### Fine-Tuned Model Pipeline The third example extends the architecture to include LoRA (Low-Rank Adaptation) fine-tuned models instead of just RAG systems. The added complexity of fine-tuning introduces additional debugging challenges—for instance, when the entire API stops responding, the root cause could be GPU memory exhaustion on any of the machines running fine-tuned models. ## Key Takeaways and Recommendations The speaker emphasizes that there is no "one-size-fits-all" solution for agentic AI MLOps. When asked about specific tool recommendations, they deliberately avoid pitching their own platform and instead recommend using the rubric as a diagnostic tool: - Evaluate your current tool stack against the rubric - Identify which pieces are not being properly tracked - Determine what needs improvement - Accept that some level of customization will likely be required for specific use cases The underlying message is that building intuition about what "good" looks like for agentic AI operations is more important than adopting any specific tool. Just as humans have learned over their lifetimes to quickly assess whether a house is in good or bad condition, ML practitioners need to develop similar intuition for assessing the health of their AI operations. ## Critical Assessment While this presentation offers valuable conceptual frameworks, it's important to note several limitations: - The content is primarily conceptual rather than presenting concrete results or metrics - The examples are hypothetical rather than drawn from specific customer implementations - As a vendor presentation, there's an inherent bias toward problems that the speaker's platform is designed to solve - The audience survey data is anecdotal and based on a small conference audience That said, the challenges described—debugging complexity, cross-team coordination overhead, and the difficulty of maintaining reproducibility in multi-agent systems—align with widely reported industry challenges. The proposed rubric provides a practical starting point for organizations looking to assess their readiness for production agentic AI deployments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source