Various (Meta / Google / Monte Carlo / Azure): Enterprise Infrastructure Challenges for Agentic AI Systems in Production

LLMOps Database

Tech

Various (Meta / Google / Monte Carlo / Azure)

Company

Various (Meta / Google / Monte Carlo / Azure)

Title

Enterprise Infrastructure Challenges for Agentic AI Systems in Production

Industry

Tech

Link

https://www.youtube.com/watch?v=EdeXnwxqIuY

Year

2025

Summary (short)

A panel discussion featuring engineers from Meta, Google, Monte Carlo, and Microsoft Azure explores the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion reveals that agentic workloads differ dramatically from traditional software systems, requiring complete reimagining of reliability, security, networking, and observability approaches. Key challenges include non-deterministic behavior leading to incidents like chatbots selling cars for $1, massive scaling requirements as agents work continuously, and the need for new health checking mechanisms, semantic caching, and comprehensive evaluation frameworks to manage systems where 95% of outcomes are unknown unknowns.

This panel discussion brings together senior infrastructure engineers from major technology companies to discuss the operational challenges of deploying agentic AI systems at scale. The participants include Adelia (Dia), an engineer at Meta working on data infrastructure for recommendation systems and generative AI; Anna Berenberg, an engineering fellow at Google Cloud responsible for platform solutions; Barb Moses, CEO of Monte Carlo focusing on data and AI observability; and Chi, corporate vice president at Microsoft Azure managing Kubernetes and cloud native services. The conversation centers on a fundamental premise: agentic AI workloads represent a paradigm shift that requires rebuilding infrastructure from the ground up. Unlike traditional software systems that execute predetermined workflows, AI agents explore massive search spaces autonomously, taking hundreds or thousands of steps to complete tasks. This creates unprecedented challenges for production systems that were designed for deterministic, predictable workloads. **Core Infrastructure Challenges** The panelists identify several critical areas where traditional infrastructure approaches fail when applied to agentic systems. Meta's Dia explains that agents working on tasks like web navigation for shopping can require complex multi-step interactions across multiple websites, each step potentially failing or producing unexpected results. The scale of these operations—imagine an agent organizing a summer camp and needing to purchase t-shirts in various sizes and colors with overnight delivery—demonstrates how agents can generate massive loads on systems that weren't designed for such intensive automated interactions. Google's Anna highlights fundamental networking challenges that emerge with agentic systems. Traditional HTTP protocols expecting millisecond request-response cycles break down when dealing with sessions lasting seconds or minutes. Load balancing mechanisms that distribute requests based on weighted round-robin algorithms become ineffective when a single request can utilize 100% CPU resources. The networking stack must now understand token counts and queue sizes rather than simple request metrics. Even basic concepts like caching become obsolete since agents typically generate different responses each time, necessitating the development of semantic caching systems. **Security and Reliability Concerns** The discussion reveals alarming real-world examples of agent failures that highlight the critical importance of robust LLMOps practices. Barb Moses from Monte Carlo shares incidents including a Chevy dealership chatbot that was convinced to sell a car for $1 and another case where a customer support chatbot hallucinated non-existent policies. These examples underscore how unreliable AI systems can have immediate revenue and brand implications. Azure's Chi discusses sophisticated security attacks that exploit the agent ecosystem, including a recent GitHub incident where attackers embedded malicious prompts in GitHub issues, leading to the creation of compromised MCP servers that exposed private repositories. This demonstrates how the increased capability of LLMs benefits both legitimate developers and malicious actors, creating new attack vectors that traditional security measures aren't equipped to handle. **Observability and Monitoring** Monte Carlo's approach to AI system reliability centers on holistic thinking about data and AI estates. Rather than treating AI systems as separate from data systems, they advocate for integrated observability across the entire pipeline. They identify four core failure modes for AI systems: incorrect data ingestion, code problems with downstream implications, system failures in orchestration, and model outputs that are inaccurate or unfit for purpose despite perfect execution of all other components. The panel emphasizes the need for comprehensive instrumentation of agent activities. Every call, prompt, and decision step must be tracked not just for debugging purposes but also for creating training data for future agent iterations. This level of instrumentation goes far beyond traditional application monitoring, requiring deep understanding of agent trajectories, prompt engineering, and model decision-making processes. **Health Checking and System Reliability** A fascinating aspect of the discussion centers on redefining basic infrastructure concepts like health checking for agentic systems. Traditional health checks assume deterministic behavior—a service is either healthy or unhealthy based on predictable metrics. But how do you determine if an agent is healthy when the underlying LLM might hallucinate or when a prompt injection attack might hijack the agent's behavior? The panelists suggest that entirely new concepts of system health must be developed, potentially including semantic understanding of agent outputs and behavior analysis over time. **Evaluation and Testing Frameworks** Meta's experience with coding and web navigation agents reveals that traditional testing approaches are inadequate for agentic systems. Dia explains that while typical projects might have 5-10% unknown unknowns, agentic systems flip this ratio with 95% unknowns. This necessitates developing sophisticated evaluation frameworks that can test across multiple difficulty levels and system interactions. The evaluation must occur in secure sandboxed environments to prevent agents from corrupting production systems during testing. The challenge extends beyond individual agent testing to workflow evaluation across multiple connected systems. When multiple agents work autonomously within an organization—potentially 10 agents per employee working 24/7—the combinatorial complexity of possible interactions becomes staggering. Each agent might be orders of magnitude more productive than human workers, creating scenarios where traditional testing approaches simply cannot cover the space of possible outcomes. **Data Infrastructure and Scaling** The infrastructure implications extend to fundamental data management challenges. Agents generate massive amounts of interaction data that must be stored, processed, and analyzed for both operational and training purposes. The traditional data stack—databases, data warehouses, ETL pipelines—must now integrate with orchestration systems, agent frameworks, prompt management systems, and RAG pipelines. This expanded architecture requires platform teams to manage significantly more complex systems while maintaining reliability across the entire stack. **Human-in-the-Loop Considerations** Despite the autonomous nature of agents, the panelists emphasize the critical importance of human oversight, particularly for actions that have significant consequences. Monte Carlo's experience with their troubleshooting agent illustrates this balance: while users appreciate agents that can automatically research and summarize incident root causes, they strongly prefer to maintain human approval for any corrective actions. This suggests that effective LLMOps implementations must carefully balance automation with human control, particularly for high-stakes decisions. **Future Predictions and Industry Outlook** The panel concludes with bold predictions about the near-term evolution of agentic infrastructure. They anticipate that within 12 months, the distinction between "agentic" and traditional software will disappear as AI becomes ubiquitous. They predict advances in intent-based configuration driven by natural language processing, potential breakthrough mathematical problem-solving by AI systems, and expansion of agents from software into hardware robotics applications. The discussion reveals that while the challenges are significant, the industry is rapidly developing new approaches to handle agentic workloads. The fundamental insight is that treating agentic systems as slightly modified traditional software is insufficient—they require ground-up rethinking of infrastructure, security, monitoring, and operational practices. Success in this environment requires organizations to develop new expertise in AI evaluation, prompt engineering, agent orchestration, and semantic monitoring while maintaining the reliability and security standards expected of production systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source