This panel discussion provides a comprehensive view of the current LLMOps landscape from multiple industry perspectives, featuring experts from Thinking Machines, Perplexity, Evolutionary Scale AI, and Axiom. The conversation reveals key insights about the production challenges and opportunities in deploying large language models across different domains and scales.
**Agentic Framework Proliferation and Future Direction**
The discussion begins with Horus from Thinking Machines addressing the explosion of agentic frameworks, noting that one GitHub list contained 93 different frameworks. He characterizes this as a "Cambrian explosion" typical of new technology areas with low barriers to entry, where fundamentally "an agent framework is just a string templating library." This proliferation mirrors the earlier framework wars between PyTorch and TensorFlow, suggesting a natural consolidation phase will follow.
The panelists debate whether these frameworks have long-term viability or represent a transitional phase. Horus argues that as models improve their reasoning capabilities through post-training techniques, the need for external orchestration frameworks may diminish. He suggests that reinforcement learning and end-to-end training approaches may eventually subsume much of what current agentic frameworks attempt to accomplish through prompting and tool orchestration.
The framework discussion reveals a fundamental tension in LLMOps: the balance between providing useful abstractions while maintaining flexibility for rapidly evolving use cases. The panelists note that many frameworks fail because they make incorrect assumptions about where model capabilities are heading, creating walls that developers must work around rather than enabling them.
**Reinforcement Learning in Production LLM Systems**
Drew from the embodied AI space provides historical context for the current RL renaissance, noting that "in the beginning, all learning was reinforcement learning." He emphasizes that RL's strength lies in scenarios with a "verifier-generator gap" where it's easier to evaluate whether something is correct than to generate it in the first place. This applies to board games, code with unit tests, and mathematical theorem proving.
The discussion reveals how modern RL applications differ from previous attempts. Unlike earlier RL that started from scratch, current approaches leverage pre-trained models with broad world understanding. The panelists discuss how recent successes like DeepSeek's theorem proving are technically "bandit problems" rather than full RL, since they involve internal token generation rather than interaction with external environments.
From a production perspective, the conversation highlights infrastructure challenges when deploying RL systems. Drew notes that systems interacting with external environments (bash terminals, web browsers, text editors) require different optimization approaches than purely internal generation. This creates variable-length batching challenges and the need for high-throughput parallel environment simulation, representing significant LLMOps infrastructure requirements.
**Domain-Specific Applications and Production Challenges**
Sal from Evolutionary Scale AI discusses applying LLMs to biological sequence modeling and protein design. His experience illustrates the progression from general language model capabilities to specialized scientific applications. He describes how weekend experimentation with protein design models produced results that surprised domain experts, demonstrating the rapid capability advancement in specialized domains.
The biology application reveals how LLMOps extends beyond text generation to scientific discovery. Evolutionary Scale combines "simulators" (models that understand biological sequences) with "reasoning models" that can operate in the latent space of amino acids rather than just natural language. This fusion represents a sophisticated production architecture that goes beyond simple text completion.
Karina from Axiom discusses mathematical reasoning applications, outlining four stages of AI mathematical capability: problem solving, theorem proving, conjecturing, and theory building. She emphasizes the challenge of moving beyond pattern matching to genuine mathematical reasoning without hallucination. The mathematical domain provides clear verifiable rewards for RL applications, but requires sophisticated infrastructure to convert natural language mathematical literature into formal verification systems like Lean 4.
**Infrastructure and Scaling Bottlenecks**
Yugen from Perplexity identifies infrastructure as the primary bottleneck in production LLM deployment. He notes the trend toward larger models (Llama 3's 70B to Llama 4's 400B parameters) requires sophisticated parallelization across hundreds to thousands of GPUs. Current libraries often force trade-offs between ease of use and comprehensive parallelism support.
The conversation reveals specific LLMOps pain points: libraries that support some parallelism strategies but not others, tools that handle all parallelism but don't support specific attention mechanisms, and systems that are too complex for rapid iteration. The addition of agentic workflows with tool calling creates additional complexity around efficient weight transfer between training and inference infrastructure.
The panelists describe building infrastructure from scratch due to inadequate existing tools, highlighting the gap between research capabilities and production tooling. Small teams find themselves implementing distributed training systems rather than focusing on model development and experimentation.
**Production Tool Calling and Model Capabilities**
Several panelists identify tool calling as a critical capability gap in production systems. Yugen notes that while open-source models are approaching API model quality for general chat, they lag significantly in tool calling capabilities. This represents a practical deployment challenge where teams must choose between model quality and API costs.
Drew envisions a future where "every digital surface that a human interacts with" becomes a tool call interface, requiring high-throughput, high-bandwidth environment libraries. This vision requires infrastructure that can handle diverse external system interactions at scale.
**Framework Consolidation and Future Directions**
The panel discussion suggests that the current framework proliferation will eventually consolidate around more principled approaches. The consensus points toward increasing model capabilities reducing the need for complex external orchestration, with RL-based post-training providing more direct paths to desired behaviors.
The conversation also reveals the need for better abstraction layers that can adapt to rapidly changing model capabilities. The panelists suggest that successful frameworks will need to provide fundamental building blocks that remain useful as model capabilities evolve, rather than high-level abstractions that become obsolete.
**Technical Infrastructure Requirements**
Throughout the discussion, several technical requirements emerge for production LLM systems: better sharded checkpointing for large models, improved tensor parallelism support, more efficient model scaling across data centers, and better tool calling capabilities in open-source models. The panelists specifically request infrastructure that makes distributed training feel like local development while handling the complexity of multi-GPU, multi-node deployments.
The discussion concludes with specific technical requests for PyTorch improvements, including kernel authoring tools, distributed tensor operations, symmetric memory management, and plug-and-play RL scaling capabilities. These requests reflect the practical challenges of moving from research prototypes to production systems at scale.
This panel discussion provides valuable insights into the current state of LLMOps across multiple domains and companies, revealing both the opportunities and significant infrastructure challenges in deploying large language models in production environments. The conversation suggests that while model capabilities are advancing rapidly, the tooling and infrastructure for production deployment remain significant bottlenecks that require focused engineering effort to resolve.