Various (Alation, GrottoAI, Nvidia, OLX): Open Source vs. Closed Source Agentic Stacks: Panel Discussion on Production Deployment Strategies

Overview

This panel discussion provides a comprehensive exploration of the practical challenges and strategic decisions involved in deploying agentic AI systems to production. The panel features four speakers with diverse perspectives: Adele from Nvidia’s enterprise product group working on agent libraries and microservices, Olga leading product analytics at OLX with experience building internal data and AI tooling, Laurel as a founding engineer who has deployed agents at multiple startups including Numberstation, Alation, and Stacklock, and Ben as CTO of GrottoAI (a multifamily vacancy loss reduction company) and former founding engineer at Galileo. The discussion is particularly valuable because it represents both builders of LLMOps tooling (Nvidia, previously Galileo) and consumers of such tooling (GrottoAI, OLX, Alation), providing a balanced perspective on what actually works in production.

When to Choose Open Source vs. Closed Source

The panel’s consensus approach to open source versus closed source tooling is notably pragmatic rather than ideological. Laurel advocates for starting with closed source foundation models (like GPT-4 or Claude) unless teams are already comfortable with model hosting, arguing that the operational overhead of self-hosting models can distract from the core task of understanding whether agents can solve the business problem at hand. She recommends using the first phase as an experimental learning period where teams try multiple open source agent frameworks (LangChain, LangGraph, Crew AI, etc.) to understand their strengths and weaknesses, then coming to stakeholders with informed recommendations about what worked and what didn’t.

Adele from Nvidia frames the decision differently, emphasizing that open source and closed source systems are not mutually exclusive but rather serve different purposes within the same agentic system. He describes a typical pattern where teams initially use frontier models from Anthropic or OpenAI to demonstrate that a use case can be solved with generative AI, essentially proving the concept with readily available, high-quality models. However, two key inflection points drove Nvidia and its customers toward open source models: the release of Meta’s Llama 3 and DeepSeek’s reasoning model in January (presumably 2025), which demonstrated that open source models were closing the capability gap with proprietary frontier models. More critically, once proof of concept is established, compliance and data privacy requirements become paramount. Adele notes that for many enterprise use cases at Nvidia, they simply cannot send prompts to external APIs due to compliance constraints, forcing them to deploy open source models internally. This represents a common enterprise pattern: prototype with closed source for speed, then transition to open source for compliance, cost optimization, and scale.

Framework Selection and Abstraction Levels

A particularly nuanced discussion emerges around agent framework selection and the level of abstraction teams should accept. Laurel shares hard-won experience about starting with highly abstracted frameworks that made initial development easy but created significant debugging challenges. She describes how pre-reasoning-model frameworks relied on complex orchestrator communication protocols as a “cheap form of reasoning” to split tasks across multiple agents. However, with the advent of advanced reasoning models like O3, their team’s philosophy shifted dramatically toward simplicity: a single good reasoning model with curated tools often outperforms complex multi-agent frameworks with eight agents talking to each other. The debugging and maintenance burden of such complex systems becomes nearly insurmountable when message history gets corrupted or agents become confused.

This led Laurel’s teams to eventually roll out their own agent framework (which she admits was “probably not worth it in the long run”) before ultimately selecting BAML, which she describes as “the lowest abstraction I could find” that still provides helpful utilities for model calls and basic communication without heavy abstraction layers. Her recommendation is clear: choose the lowest-level abstraction that still saves you from painful boilerplate, because when (not if) you hit edge cases and bugs, you need to understand exactly what’s happening under the hood. Ben echoes this sentiment from his experience at GrottoAI, noting that many open source tools are so abstract that teams quickly hit walls after initial rapid progress, creating a sunk cost fallacy where they feel compelled to continue despite not understanding the underlying mechanics.

Adele presents Nvidia’s approach to this challenge at an enterprise scale through their Nemo Agent Toolkit, which takes a fundamentally different architectural approach. Rather than replacing existing frameworks, Nemo acts as a meta-framework that works alongside LangGraph, Crew AI, Autogen (now Semantic Kernel), and custom Python implementations. Nvidia recognized that different teams across their organization were building agents using different frameworks, each with valid reasons for their choices, and attempting to force standardization would stifle innovation. The Nemo Agent Toolkit addresses this heterogeneity through three key capabilities: interoperability via decoration of agents built on different frameworks, observability across the entire system of agents through OpenTelemetry trace collection, and profiling that enables Nvidia’s unique strength in full-stack acceleration (making intelligent decisions about disaggregated versus aggregated computing). This approach acknowledges that data gravity and existing tooling ecosystems often dictate framework choices (for example, teams with data in AWS might naturally use Agent Core with Strands), so rather than fighting this reality, Nvidia provides tooling that works across all frameworks.

Observability and Evaluation Challenges

The panel identifies observability and evaluation as perhaps even less solved than the agent orchestration stack itself. Laurel raises an important concern about observability platforms being built directly into agent frameworks, creating potentially problematic lock-in effects. She cites examples like LangSmith (built into LangChain), Pydantic’s observability tools, and various other framework-specific logging systems. While these platforms often claim OpenTelemetry support and compatibility with other frameworks, the practical reality is that no engineer is eager to integrate a competitor’s observability stack, leading to effective lock-in even when technical compatibility exists.

Both Adele and Olga emphasize the importance of standards-based approaches. Nvidia’s Nemo toolkit explicitly outputs OpenTelemetry traces rather than proprietary formats, allowing organizations to continue using their existing observability platforms (Datadog, Weights & Biases, LangSmith) without forced migration. This becomes critical at enterprise scale where different teams have already standardized on different tools for specific agents or workflows.

Laurel shares her current approach at a small startup (four to six people) where they use BAML for agent orchestration, output logs in JSON format, store everything in a lakehouse, and query with SQL. This bare-bones approach works well at their scale but she acknowledges it would need to evolve as the company grows. The key insight is that evaluation needs are highly custom to each enterprise’s specific use cases, making generic solutions difficult to implement effectively. Ben notes this pattern from his experience building internal LLM tooling at multiple companies: every enterprise’s evaluation needs are sufficiently unique that off-the-shelf solutions often don’t fit well.

The discussion also touches on the critical importance of evaluation methodology. Laurel emphasizes that teams should spend a couple of weeks experimenting with different frameworks to understand what works and what doesn’t for their specific use case before making production commitments. This experimental phase should result in clear documentation of where each framework succeeded and where it failed, providing a baseline for evaluating any future proprietary solutions. The evaluation framework at GrottoAI, for example, is deliberately simple: they maintain spreadsheets of test data with expected outputs, and if a BAML prompt achieves the target score on that dataset, it’s considered ready for production deployment. This simplicity enables rapid iteration while maintaining quality standards.

Standardization vs. Innovation in Enterprise Environments

A fascinating tension emerges in the discussion around standardization versus enabling innovation, particularly relevant for large enterprises. Ben poses the challenge directly, citing ZenML’s perspective that allowing different teams to choose different tools might work for a six-person startup but becomes untenable at enterprises with 40, 50, 100, or 300 people. Adele’s response provides important nuance by distinguishing between different layers of the stack and different environments within the enterprise.

Nvidia’s approach involves standardizing certain foundational infrastructure components that must be common across all teams: Kubernetes as the orchestration layer (specifically Red Hat’s distribution internally), JFrog Artifactory for artifact storage, and common approaches to their AI Ops platform and MLOps tooling. These standardized building blocks provide the necessary consistency for security, compliance, and operational efficiency. However, Nvidia explicitly maintains what Adele calls a “sandbox” environment that allows teams to explore different agent frameworks and tools.

The critical distinction is between experimentation and operationalization. Teams can experiment with various frameworks in the sandbox, but moving to production involves a “big lift” that requires compliance approval, proper observability integration, privacy guarantees, and adherence to the standardized infrastructure components. This approach allows Nvidia to benefit from innovation happening across different teams while still maintaining operational control and compliance when agents go into production. Adele frames this within Nvidia’s concept of “AI factories” - on-premises data centers designed to generate tokens (intelligence) at scale - noting that they can’t credibly talk about AI factories without operating one themselves and dealing with these exact challenges.

The privacy and compliance dimensions become particularly complex at this scale. Adele notes that they can’t simply collect all traces into a data lake because certain disciplines (finance, chip design) have strict prohibitions on prompt and trace collection. This requires implementing differential privacy techniques and sophisticated governance, making observability significantly more challenging than just instrumenting code to output logs. This real-world constraint highlights why enterprise LLMOps is fundamentally different from startup LLMOps: the compliance and privacy requirements create architectural constraints that small companies often don’t face.

Low-Code and No-Code Agent Building

The panel provides valuable skepticism about the current state of low-code and no-code agent building tools, a topic frequently hyped in the industry. Laurel’s experience is particularly illuminating: when building systems where the company owns AI quality and is directly responsible to customers, low-code capabilities are largely irrelevant because engineers need precise control over every aspect of the system. In their second company, they built a low-code agent builder for customers, but discovered that most users wanted to describe their problem and have it solved for them rather than build solutions themselves. The cognitive load of understanding how to debug AI systems - reading prompts, understanding model behavior, managing conversations - represents an entirely new skill set that most business users neither have nor have time to develop.

Olga provides an important counterpoint by distinguishing between customer-facing and internal-facing use cases. For internal productivity tools where users are improving their own workflows, she’s enthusiastic about low-code solutions despite the investment required in guidance and governance. The feeling of empowerment from being able to automate one’s own tasks is valuable enough to justify the educational overhead. This aligns with Laurel’s observation that user experience and templates matter significantly, and that workflow builders (which feel like deterministic flows with hints of AI) are often more relatable to business users than pure prompt engineering interfaces.

Ben synthesizes this into a clear framework based on tolerance for failure and precision requirements. Internal tools used by engineering and data science teams might operate effectively at 60-70% precision, providing value despite imperfection. However, only engineering and data science teams at GrottoAI deploy customer-facing agents, and these must achieve 95%+ precision, sometimes 99%+. The tooling for these high-precision systems is deliberately bare-bones: BAML for prompt development, spreadsheet-based evaluation datasets, and clear deployment criteria. This simplicity actually enables faster movement to production because the evaluation criteria are transparent and the tooling is standardized across technical teams.

Model Selection and the Reasoning Model Revolution

While not the primary focus, the panel discusses the impact of reasoning models on agentic system design. Adele identifies the January release of DeepSeek’s open reasoning model as a watershed moment that demonstrated open source models could match proprietary model capabilities. Laurel describes how reasoning models like O3 fundamentally changed their architectural approach: instead of complex multi-agent orchestration serving as a form of reasoning, they could rely on a single powerful reasoning model with well-designed tools. This simplification dramatically reduces complexity, improves debuggability, and often achieves better results than elaborate agent coordination protocols.

This shift illustrates a broader pattern in LLMOps where improvements in foundation models can obsolete entire categories of engineering complexity. Teams that invested heavily in multi-agent orchestration frameworks designed to compensate for limited model reasoning capabilities suddenly found that simpler architectures with better models outperformed their complex systems. This highlights the importance of maintaining flexibility in agent architecture and not over-investing in compensating for model limitations that may soon be solved by better models.

Practical Recommendations and Patterns

Several concrete patterns emerge from the discussion that represent current best practices for production agentic systems:

Prototyping Pattern: Start with closed source frontier models and experiment with multiple open source agent frameworks simultaneously. Spend 2-3 weeks understanding strengths and weaknesses of different approaches before committing to production architecture. Document failures as carefully as successes to establish baseline requirements.

Production Architecture: Choose the lowest abstraction level framework that still saves meaningful boilerplate. Simpler is better because debugging AI systems is fundamentally different from debugging traditional software. Complex multi-agent systems with many agents communicating are extremely difficult to debug and maintain when message history corruption or agent confusion occurs.

Observability Strategy: Standardize on OpenTelemetry for trace output to avoid lock-in to framework-specific observability platforms. At small scale, even basic lakehouse storage with SQL queries can be effective. At enterprise scale, support multiple observability platforms that different teams have already adopted rather than forcing migration.

Standardization Approach: Standardize infrastructure (Kubernetes, artifact storage, AI Ops platforms) and security/compliance requirements, but allow experimentation with different frameworks in sandbox environments. Create clear gates between experimentation and production operationalization.

Evaluation Methodology: Maintain dataset-based evaluation with clear precision requirements tied to use case criticality. Internal tools can operate at 60-80% precision; customer-facing systems need 95%+ precision. Simplicity in evaluation process enables faster iteration.

Model Selection: Start with closed source models unless hosting expertise already exists or compliance requires it. Transition to open source models when compliance dictates, costs become prohibitive at scale, or when open source models match required capabilities (increasingly common post-Llama 3 and DeepSeek reasoning models).

The panel ultimately presents a pragmatic, experience-driven perspective on production agentic systems that acknowledges the immaturity of the tooling ecosystem while providing clear patterns for navigating current challenges. The emphasis throughout is on simplicity, clear evaluation criteria, and maintaining flexibility as both models and frameworks continue to evolve rapidly.

Open Source vs. Closed Source Agentic Stacks: Panel Discussion on Production Deployment Strategies

Industry

Technologies

Overview

When to Choose Open Source vs. Closed Source

Framework Selection and Abstraction Levels

Observability and Evaluation Challenges

Standardization vs. Innovation in Enterprise Environments

Low-Code and No-Code Agent Building

Model Selection and the Reasoning Model Revolution

Practical Recommendations and Patterns

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Running LLM Agents in Production for Accounting Automation