## Overview
This panel discussion provides comprehensive insights into three distinct production LLM deployments from AI-native companies that have been operating for approximately three years. The speakers are Sam (CTO and co-founder of Delphi, a platform for creating personalized AI profiles), Nick (co-founder of Seam AI, building sales and marketing automation agents), and Dave (head of product at APIsec, developing API security testing tools). The discussion covers the complete LLMOps lifecycle including infrastructure evolution, architectural decisions, cost management, security considerations, ROI measurement, and the philosophical shift from deterministic workflows to agentic autonomy.
## Company Use Cases and Technical Implementations
**Delphi** operates as a "living profile" platform that enables users to upload their content—from LinkedIn profiles to YouTube videos and podcasts—and creates a digital representation that learns how they think and speak. The technical challenge involves creating a generalized system that can produce accurate representations of any person rather than manually crafting specific implementations. Sam mentions they've processed over one trillion tokens through OpenAI (publicly recognized at OpenAI Dev Day), indicating massive scale in production. The platform uses knowledge graphs to represent personal information and has evolved from having many opinionated components about prompt formatting and intention detection to a much more autonomous agent architecture.
**Seam AI** positions itself as "DataDog for sales and marketing," building agents that monitor for in-market buyers and execute campaigns automatically when important moments occur. The company has progressed through three distinct phases: single-shot prompting for data enrichment in 2023 (when GPT-3 had only 4,000 token context windows), workflow automation with individual LLM steps embedded in deterministic workflows, and most recently fully agentic systems. Their architecture ingests structured data from go-to-market systems and has evolved from applying summaries for human action to autonomous campaign execution.
**APIsec** takes a unique approach by building security testing programs that uncover unknown risks through the API layer. Their system performs small units of work against customer applications, then pauses to strategize the next step using state machines for reasoning. The critical difference in their deployment is the requirement for extremely high confidence and deterministic behavior, as they're advising organizations to activate engineering resources to fix identified issues. They must minimize false positives (noise) while understanding not just what actions are possible through APIs but what actions should be allowed from a security perspective.
## Infrastructure Evolution and Architectural Decisions
The panelists describe a fundamental shift in infrastructure philosophy over three years. Sam highlights the trend toward serverless and infinitely scalable infrastructure, specifically mentioning Planetscale (serverless database) and Temporal (for reactive programming). This architectural approach has "automatically built in a lot of what you had to think about in the past to create scalable infrastructure." The serverless paradigm allows companies to focus on building features rather than managing infrastructure scaling concerns.
A critical philosophical debate emerged around state machines versus model autonomy. Dave notes that APIsec relies heavily on state machines and step functions because "the deterministic nature of those state machines is very important" when advising organizations on security fixes. However, Sam challenges this approach by referencing Richard Sutton's 2019 essay "The Bitter Lesson," which argues that attempts to imbue human-based opinions into ML systems consistently fail compared to simply increasing computational resources. Sam advocates for removing assumptions about how systems should behave and instead clearly defining tasks, providing tools, and letting models reason autonomously.
Nick describes this as "letting go" after spending a decade in enterprise software at Okta where everything needed to be deterministic with opinions injected throughout the stack. He references Anthropic's approach with Claude Code, which eschews predefined tool calls in favor of giving the model command-line access to develop its own tools as needed. This represents a significant shift in LLMOps philosophy from heavily engineered workflows to emergent agentic behavior.
## Model Capabilities and Architectural Patterns
Sam emphasizes betting on LLMs as reasoning units rather than knowledge stores, arguing that using models for factual Q&A is "prone to hallucinations" and an "uphill battle." Instead, Delphi architectures around models interpreting situations and making decisions, then accessing skills as needed. He references Anthropic's recently released "Claude Skills" as an example—markdown files with instructions for manipulating specific file formats (PPTX, DOCX) that the agent loads when needed rather than having this knowledge embedded in the model.
The progression from constrained to expansive architectures is evident across all three companies. In 2023, with GPT-3's 4,000 token context window, companies focused on single-shot prompting for search, summarization, and data enrichment. As context windows expanded and models improved, architectures evolved to support multi-step workflows and eventually fully agentic systems. Nick emphasizes that recent months have seen models become "so good that we've truly started to build agentic style systems."
## Cost Management and Token Economics
Cost management emerges as a central concern with nuanced approaches. Sam's revelation that Delphi burned through one trillion OpenAI tokens—described as "one of the most embarrassing things to be publicly recognized for"—provides context for the scale of production LLM deployments. However, his philosophy prioritizes proving product-market fit over cost optimization, especially in consumer contexts where the primary challenge is changing user behavior rather than managing expenses.
Sam argues that in consumer applications, the focus should be on proving that people will adopt new behaviors (creating and sharing AI profiles) before optimizing costs. He acknowledges not making "unforced errors" but maintains that "cost is the last consideration" until product validation is achieved. This contrasts with the enterprise B2B approach where Nick notes they must be "conscious" of costs to avoid "drowning in infrastructure costs" by releasing products broadly before achieving the right pricing strategy.
Both panelists note that model costs have decreased approximately 100x over two years while capabilities have improved, following a trajectory better than Moore's Law. Sam argues that every time compute doubles, models get roughly 3-5x smarter. This trend informs architectural decisions—building for increased computation rather than optimizing for today's cost structure. However, for smaller companies without massive funding, Nick emphasizes running limited releases to validate ROI before scaling broadly.
The discussion touches on latency-cost tradeoffs, where Delphi uses specialized inference providers like Groq (with Q) and Cerebrus that offer novel chip architectures for faster inference with open-source models. These provide better latency than flagship research lab models, though with somewhat reduced intelligence—a pragmatic tradeoff for real-time use cases. Sam specifically notes avoiding GPT-4.5 due to excessive latency despite superior capabilities.
## Observability, Monitoring, and Debugging
Observability infrastructure receives detailed attention as a critical LLMOps concern. Sam describes using Pydantic AI as their primary framework, praising both Pydantic as a company and their instrumentation tool Logfire. Logfire automatically instruments agent behavior and accepts OpenTelemetry records, allowing Delphi to create "full distributed traces that include all the agentic logs for all the decisions it made and the tools that it called."
This approach to observability reflects enterprise software best practices applied to LLM systems. Nick draws explicit parallels to DataDog for traditional software monitoring, positioning their sales/marketing agents as bringing similar observability capabilities to go-to-market functions. The ability to trace decision flows, tool calls, and reasoning steps becomes essential for debugging agentic systems that make autonomous decisions.
For Delphi's consumer-facing application, they implement a "conversations page" where users can review all interactions their digital mind has had, though recognizing users won't manually review everything. Instead, they've built an "insights dashboard" with alerts for low-confidence answers or unanswerable questions, allowing users to provide corrections that improve future responses. This creates a feedback loop where the digital profile continuously learns from user corrections.
## Security and Compliance Considerations
Dave provides unique perspective on securing AI systems while simultaneously using AI for security testing. APIsec works with government organizations with extremely high security bars including FedRAMP certification, requiring them to learn not just compliance but how to guide design partners through similar journeys. They "dogfood" their own security tools, using APIsec to secure their own environments before offering similar capabilities to customers.
An interesting technical point emerges around Model Context Protocol (MCP) and API security. Dave references a presentation showing a Scooby-Doo reveal where MCP's mask gets pulled off to reveal "API"—the point being that MCP doesn't fundamentally change security paradigms since agents still access data through standard API patterns. However, he notes that the volume of API calls from agentic systems differs significantly, requiring "multiple angles, the proactive angle, the reactive angle" rather than just traditional "watch, detect, block" approaches.
The security discussion extends to trust and determinism. Dave explains that APIsec requires "extremely high confidence" when advising organizations to activate engineering resources, making the deterministic nature of state machines valuable despite the trend toward model autonomy. This represents a key tension in LLMOps between leveraging model reasoning capabilities and maintaining sufficient control for high-stakes decisions.
## ROI Measurement and Outcome-Based Pricing
Nick provides a sophisticated framework for measuring AI ROI across three levels, explicitly ranking them by strength. **Level 1 (weakest)** is productivity increases—showing how much faster tasks complete. He argues this is weak because it's "really arbitrary" whether time savings translate to more productive activities or whether tasks should have taken that long initially. **Level 2 (stronger)** is cost savings—demonstrating that the business saves money compared to the current state. **Level 3 (strongest)** is enabling capabilities that were previously impossible, where Nick argues "you can in theory charge whatever the hell you want."
Seam AI targets Level 3 by measuring pipeline generated without human involvement as their north star metric, similar to how Uber measures rides per month or Slack measures messages per active user per day. Dave echoes this with APIsec's value proposition around eliminating engineering disruption during annual pentests—customers continue innovating without roadmap shifts when security compliance is continuously maintained rather than disruptive annual events.
Sam adds that product success typically follows power laws where "one thing has a massive outsized impact" rather than being the sum of many features. The LLMOps challenge involves identifying and optimizing that critical flow that drives retention and value. This philosophy informs their iterative approach to features and user experience.
The discussion touches on outcome-based pricing as an evolution from traditional seat-based or platform licensing. Nick notes this becomes compelling for enterprises because generating pipeline without human time investment represents net-new capability. However, Sam references research showing that 95% of paid AI pilots fail, with the successful 5% distinguished by having clear outcomes defined at the pilot's start—emphasizing the importance of outcome definition in AI deployments.
## Framework and Tooling Decisions
When asked about navigating the complex landscape of agent frameworks, Nick provides practical guidance starting with Claude Code and Claude Desktop as entry points for understanding agentic behavior. He recommends Anthropic's paper "Building Effective Agents" as the best starting point for developers. The panelists emphasize a "Raptor engine" metaphor from SpaceX—starting with rough prototypes that barely work (Raptor 1), iterating to functional but unpolished versions (Raptor 2), and eventually reaching clean, production-ready implementations (Raptor 3).
Sam's specific tooling recommendation focuses on Pydantic AI, emphasizing that agents are fundamentally "just tools in a loop" with a system prompt and capabilities (functions, LLM calls, API calls, MCP servers) that the model repeatedly evaluates until completion. He advocates for Postgres over graph databases as a way to avoid overcomplicating infrastructure choices, though Delphi itself uses knowledge graphs for representing personal information.
The vector database discussion reveals practical production considerations. When asked about vector search versus graph databases, Sam recommends not overcomplicating infrastructure and praises Pinecone specifically for its serverless offering making new index creation simple. Both Sam and Nick emphasize Pinecone's reliability in production—they've used it for three years, rarely think about it, rarely access the dashboard, and it "just works" with fast performance and minimal downtime. Sam specifically mentions that costs have decreased since they started, aligning with their need for fast, inexpensive, and reliable infrastructure.
## Product Development Philosophy and Customer Feedback
A philosophical debate emerges around customer input in product development. Sam provocatively argues against "talking to your customers" when building something new, using Christopher Nolan as an analogy—he doesn't do audience previews before shooting films. Sam advocates for "shoot the movie," their internal mantra from lead investor Keith Rabois (former COO of Square). The argument centers on people not knowing what they want when a product represents genuinely new behavior, requiring companies to "show them what they want."
However, Stanley Tang (Door Dash co-founder, appearing in Delphi's "Library of Minds" interactive podcast) challenges this, emphasizing the need to "read between the lines of what they're asking" rather than taking customer requests literally. Sam concedes the nuance—put out a product, observe reactions, understand needed changes, but don't ask customers what to build initially. This reflects the tension between vision-driven and customer-driven development in AI products.
Nick emphasizes building deeper into use cases rather than narrow features, countering traditional advice to "build something super narrow that almost looks like a feature." He argues that because building software has become "really easy" with AI assistance, feature-oriented thinking leads to wide competition. Instead, Seam AI focuses on end-to-end automation of complete workflows rather than individual steps, combining observability (monitoring for in-market buying signals) with action (running campaigns automatically). This depth creates stronger product-market fit and enables outcome-based ROI measurement.
## Fine-Tuning and Model Customization
Sam provides clear guidance against fine-tuning for most use cases, recommending it only for "very specific use cases where it's narrow and you're trying to save money or latency and you want to have a fixed cost." He argues that general-purpose tasks typically become better, cheaper, and faster through base model improvements before fine-tuning efforts complete. Additionally, fine-tuning often "retards [the model's] ability to do other things and instruction following more generally," reducing flexibility.
This anti-fine-tuning stance represents a broader LLMOps philosophy: betting on continued model improvements rather than investing in model customization. The approach assumes that foundation model capabilities will continue scaling with compute and data, making investment in model training less valuable than investment in architecture, tooling, and data pipelines. None of the three companies reported using fine-tuning or reinforcement learning in production, despite operating at significant scale.
## User Control and Determinism Tradeoffs
Tatiana (from Waypoint, building an AI agent supervisor) raises critical questions about business user control over agent behavior, particularly around brand representation. She notes enterprise customers, especially business users rather than engineers, express fear about AI as a "black box" and want direct control over agent outputs without requiring engineering intervention.
Nick acknowledges building extensive observability showing what actions agents take with approval workflows and copy review, but describes this as temporary scaffolding. He references the "config" joke from Silicon Valley where Pied Piper's interface becomes overwhelmed with thousands of configuration buttons—engineers love config, but it doesn't represent the eventual product vision. As trust builds through demonstrated outcomes, they expect to "remove a lot of the UI from the product over time" with users having less control but greater trust.
Sam describes Delphi's approach with conversation logs, insights dashboards, and alerts for low-confidence or unanswerable questions. Users can provide feedback that updates their digital profile's knowledge base, creating learning loops. However, the ideal state involves training the profile through natural conversation—asking questions, critiquing answers, explaining corrections—rather than manual configuration. This reflects the broader tension in LLMOps between control/determinism and autonomy/intelligence.
Dave provides the counterpoint from APIsec's security perspective where determinism remains critical because incorrect recommendations waste organizational resources. This highlights that the optimal balance between control and autonomy varies by use case—consumer applications and marketing automation can tolerate more autonomy, while security recommendations and compliance decisions require higher confidence thresholds and potentially more deterministic architectures.
## Scaling Challenges and Future Directions
The companies describe various scaling challenges beyond pure computational costs. Sam envisions a future with models like Cerebrus offering 4,000 tokens per second processing combined with "a million skills" (specialized markdown instruction files), with the model "just making a ton of decisions about what it should be doing." This represents scaling through massive parallelization of autonomous decision-making rather than monolithic model scaling.
Nick describes the evolution from building "systems for humans to click through" to "building systems for systems"—AI agents that interact with other agents rather than human users. This fundamental shift in product design thinking affects everything from UI/UX decisions to API design to observability requirements. The implication is that current production LLM systems represent an intermediate state between human-centric and fully autonomous system-to-system interaction.
All three companies emphasize the competitive advantage of being "AI-native" rather than retrofitting existing platforms. Nick explicitly contrasts Seam AI's greenfield AI-native architecture with Salesforce's "retrofitting" approach, though he jokes about "tearing down the Salesforce tower brick by brick." The fundamental advantage comes from designing for agentic behavior from inception rather than adapting systems built for human workflows.
The panel concludes with practical questions about agent definitions and terminology, acknowledging that "agents" has become somewhat meaningless marketing terminology. The technical definition they converge on involves "tools in a loop" with actions taken based on LLM outputs, distinguishing this from single-shot prompts or deterministic workflows with embedded LLM steps. The key differentiator is whether outputs trigger autonomous actions rather than simply presenting information to humans.