Company
Various
Title
Infrastructure for AI Agents: Panel Discussion on Production Challenges and Solutions
Industry
Tech
Year
2025
Summary (short)
This panel discussion brings together infrastructure experts from Groq, NVIDIA, Lambda, and AMD to discuss the unique challenges of deploying AI agents in production. The panelists explore how agentic AI differs from traditional AI workloads, requiring significantly higher token generation, lower latency, and more diverse infrastructure spanning edge to cloud. They discuss the evolution from training-focused to inference-focused infrastructure, emphasizing the need for efficiency at scale, specialized hardware optimization, and the importance of smaller distilled models over large monolithic models. The discussion highlights critical operational challenges including power delivery, thermal management, and the need for full-stack engineering approaches to debug and optimize agentic systems in production environments.
## Overview This panel discussion features infrastructure leaders from major technology companies—Igor Arsovski from Groq, Bill Dally from NVIDIA, Ramine Roane from AMD, and Chuan Li from Lambda—discussing the operational challenges and infrastructure requirements for deploying AI agents in production. The conversation provides valuable insights into the evolving landscape of LLMOps, particularly as the field transitions from training-centric to inference-centric workloads. The panel was moderated by Jared Quincy Davis and offers perspectives from across the hardware and software infrastructure stack. ## Unique Infrastructure Requirements for Agentic AI The panelists identified several distinctive characteristics that differentiate agentic AI workloads from traditional LLM inference. Igor Arsovski from Groq emphasized that agentic AI generates significantly more tokens than non-agentic applications, typically requiring very long contexts. This is particularly evident in coding agents and research query systems. The key requirements he outlined include extremely low latency (users don't want to wait 10-15 minutes for queries to complete) and low cost, as agents constantly query models in iterative loops. Ramine Roane from AMD brought attention to the full-spectrum nature of agentic infrastructure, spanning from edge devices to endpoints to cloud. He highlighted that different deployment contexts have radically different constraints: laptop deployments prioritize low power consumption for battery life, automotive applications demand ultra-low latency for safety-critical decisions (like detecting pedestrians), while data center deployments focus on raw compute power, memory capacity, and memory bandwidth. Additionally, all these devices are interconnected and must discover and communicate with each other, creating a more complex distributed systems challenge than previous computing paradigms. Bill Dally from NVIDIA outlined three critical dimensions: quality, quantity, and system issues. The quantity challenge stems from tokens flowing through hierarchies of agents, each potentially employing chain-of-thought reasoning. The quality requirements demand higher bandwidth and lower latency than traditional inference workloads. System resilience becomes critical as these systems are built across many GPUs that must operate continuously. He also emphasized that agents frequently query various tools and data stores, necessitating seamless integration with databases and data lakes at scale. ## The Inference-First Hardware Architecture Igor Arsovski provided insight into Groq's approach as a hardware company that started with a software-first, inference-first philosophy in 2016. Their architecture is vertically optimized from custom silicon through custom networking to a custom software stack, all designed specifically for inference workloads. Notably, they've achieved best-in-class latencies using what he described as "10-year-old technology," suggesting that architectural choices and optimization matter more than simply using the newest fabrication processes. The company has deployed data centers across the US, Canada, Europe, and the Middle East, with approximately 1.8 million developers working on their platform. The panel discussed the fundamental distinction between training and inference hardware requirements. Arsovski argued that the infrastructure that brought the industry to its current point—primarily training-focused GPUs—is not necessarily optimal for the future, which will be inference-dominated. He noted that inference hardware needs are approximately 20x greater than training hardware needs, requiring significantly better efficiency and real-time processing capabilities. This sparked an interesting counterpoint from Bill Dally, who argued that GPUs are actually the most efficient and commonplace platform for inference today, having been specifically optimized for distinct workloads including both training and inference. ## Heterogeneous Workload Patterns The discussion revealed two distinct operational regimes within agentic infrastructure that require different optimization strategies. The first is the ultra-low latency, performance-sensitive regime where users pose questions and are blocked waiting for complex agentic graphs to unfold. In this scenario, every step must be as efficient as possible, and latency is the primary constraint. The second regime is asynchronous batch processing, where queries run overnight or in the background. Examples include code agents that check and improve codebases while developers are offline, or research agents that process large amounts of information asynchronously. This bifurcation represents a more extreme version of the classic online-serving versus offline-batch trade-off, but with unique characteristics in the AI agent context. Infrastructure must either handle both regimes well or enable efficient work distribution across specialized systems. The challenge extends beyond simple resource allocation to include considerations of how to schedule and orchestrate work that includes indexing, preprocessing, serving, and postprocessing components. Chuan Li from Lambda connected this to operational efficiency principles from cloud computing, emphasizing that elastic provisioning and load balancing—techniques refined during the early days of cloud infrastructure—remain critical. He noted that in a world where AI consumes most available compute resources, effectively distributing compute across different use cases (training, inference, agentic applications) while ensuring efficient utilization becomes paramount. ## The Renaissance of CPU Infrastructure Ramine Roane made a provocative point that challenges conventional wisdom in the AI infrastructure space: the narrative that "new compute is only GPUs and CPUs are dead" is fundamentally flawed. While GPUs are indeed experiencing explosive growth, they're actually causing CPUs to re-emerge in importance. This is particularly true for agentic AI systems that call external tools, as these tools typically run on CPU infrastructure. The full-stack nature of agentic systems means that GPU-based inference is only one component in a larger heterogeneous computing environment. This observation has important implications for infrastructure planning and cost modeling in production agentic systems. Organizations cannot simply provision GPU capacity and expect optimal performance; they must also consider the CPU infrastructure required to support tool execution, data processing, and orchestration logic. The interconnection between these heterogeneous computing resources becomes a critical design consideration. ## The Shift from Pretraining to Posttraining and Inference Multiple panelists observed a significant shift in where computational resources are being allocated. Roane noted that the explosive era of pretraining is likely slowing down, partly because the industry is running out of text data for training (though video and image data remain abundant). What's scaling rapidly instead is posttraining—particularly reinforcement learning for verifiable domains—and generative AI inference, which generates enormous volumes of tokens through techniques like tree-of-thought reasoning. This shift has profound implications for LLMOps practices. Where organizations once focused infrastructure investments almost entirely on training clusters, they must now build sophisticated inference infrastructure with different optimization profiles. The operational challenges shift from managing long-running training jobs with relatively predictable resource consumption to managing highly variable inference workloads with strict latency requirements and unpredictable traffic patterns. Bill Dally emphasized what he sees as a major error in current conventional wisdom: over-focusing on large models. While everyone worries about flagship trillion-parameter models, the practical reality of agentic systems involves distilling large models into smaller, task-specialized models. He predicts much more focus on running 8 billion or even 1 billion parameter models with extreme efficiency, rather than deploying massive monolithic models. This approach to mixture of experts operates at the agent level, with individual agents possessing just the knowledge they need for their specific tasks. ## Model Distillation and Specialization Strategies The discussion of smaller, specialized models represents a critical LLMOps strategy for production agentic systems. The pattern that emerges is one where large, expensive models are used to train or guide smaller models that are then deployed for specific tasks within an agentic workflow. This approach offers several advantages: lower inference costs, reduced latency, easier deployment to edge devices, and the ability to optimize each component model for its specific sub-task. This architecture also has implications for how organizations think about model development and deployment. Rather than a single model deployment pipeline, production agentic systems may require orchestrating dozens or hundreds of specialized models, each with its own deployment, monitoring, and updating requirements. The operational complexity increases, but so do the opportunities for optimization and cost reduction. ## Evaluation Challenges in Production Chuan Li raised an important point about evaluation being simultaneously "overindexed and underutilized." The core problem is a disconnect between leaderboard evaluations and real-world performance on specific tasks. He drew an analogy to recruiting: academic credentials and test scores matter early in someone's career but become less relevant over time compared to actual project experience and demonstrated ability to solve specific problems. For agentic systems in production, this suggests that traditional benchmark-driven evaluation approaches are insufficient. Organizations need "embodied eval" and continuous development practices that assess how well agents actually perform on the specific tasks they're deployed to solve, rather than relying solely on general-purpose benchmarks. This has implications for LLMOps workflows, suggesting the need for continuous evaluation pipelines that assess agent performance against task-specific metrics in production or production-like environments. The recent example of Llama 4 being "number 1 at all the evals" (referenced somewhat sarcastically by Roane) illustrates this disconnect—high benchmark performance doesn't always translate to superior real-world utility. This underscores the importance of developing evaluation frameworks that better capture the actual value delivered by agentic systems in production contexts. ## Stable Principles Across Computing Eras Despite the rapid evolution of AI infrastructure, the panelists identified several enduring principles. Chuan Li emphasized that operational efficiency remains paramount across technology transitions. Techniques like elastic provisioning and load balancing, proven during the early cloud computing era, continue to be relevant even as the specific technologies change. Bill Dally highlighted programmability and framework availability as persistent requirements. Users need solid programming languages (like CUDA) as foundations, along with frameworks and libraries that allow rapid prototyping without starting from scratch. The ability to quickly customize and compose existing solutions remains critical for adapting to changing customer needs. Roane identified three fundamental axes on which technology battles are won across eras: Total Cost of Ownership (including efficiency), dependability (encompassing safety, security, and reliability), and user experience. While the specific means of achieving excellence on these dimensions evolve—for example, liquid cooling and chipletization for TCO, open-source software for user experience—the core dimensions themselves remain constant. Igor Arsovski emphasized efficiency at scale as an enduring concern. Just as mobile processors were optimized for battery life, inference hardware must be optimized for power-performance efficiency given the massive scale at which tokens will be generated. User experience, particularly how quickly workloads can be deployed onto hardware, also remains a key differentiator across technology generations. ## Full-Stack Engineering for Production AI Systems A recurring theme throughout the discussion was the critical importance of full-stack engineering knowledge for building and operating production AI systems. Ramine Roane argued that the most outstanding engineers are full-stack engineers who understand hardware, communication issues, models, and application layers. This breadth of knowledge proves essential for debugging problems in production systems, where issues might originate in the model, software, hardware, or communication layers. Igor Arsovski emphasized that while chip performance (like TOPS per watt) receives most attention, equally important infrastructure questions often get overlooked: power delivery and losses, thermal cooling solutions, rack packing strategies, and overall data center design. These "less sexy" components are super important for production data centers but require understanding how the entire stack fits together. This full-stack perspective challenges the trend toward narrow specialization and suggests that organizations building production agentic systems need team members who can drive "a nail through" the entire stack—understanding not just their specific domain but how their components interact with the broader system. The moderator noted that infrastructure has profoundly shaped AI research, citing how the transformer architecture co-evolved with the TPU pod design. This historical example illustrates why arbitrary boundaries between infrastructure and research, or between different layers of the stack, can be counterproductive. ## Contrarian Views and Future Directions The panel concluded with advice for researchers and practitioners, much of it contrarian to current conventional wisdom. Bill Dally encouraged rejecting conventional wisdom and thinking from first principles about how good things could be. He argued that too many people pursue incremental improvements to existing approaches, while the biggest breakthroughs come from tossing aside assumptions and starting fresh with physical arguments about performance bounds. Chuan Li's advice to "work on small models, try to make them big models" captures the distillation and specialization trend discussed earlier. This approach focuses innovation on efficiency and capability extraction rather than simply scaling up model size. The emphasis on full-stack knowledge, first-principles thinking, and the importance of efficiency-focused innovation provides a roadmap for advancing the field of production AI systems. The discussion makes clear that despite rapid change, fundamental systems engineering principles remain relevant, and that the next wave of innovation will likely come from rethinking architectures and workflows rather than simply scaling existing approaches.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.