OpenAI / Various: Building Production AI Products: A Framework for Continuous Calibration and Development

Company

OpenAI / Various

Title

Building Production AI Products: A Framework for Continuous Calibration and Development

Industry

Tech

Link

https://www.youtube.com/watch?v=z7T1pCxgvlA

Year

2026

Summary (short)

AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.

## Overview This case study presents comprehensive insights from AI practitioners Aishwarya Raanti and Kiti Bottom, who bring extensive experience from organizations including OpenAI, Google, Amazon, Databricks, and Kumo. Having led or supported over 50 AI product deployments and teaching the top-rated AI course on Maven, they provide a practitioner's perspective on the operational challenges of deploying LLMs in production environments. The discussion centers on their observations working with enterprises throughout 2025 and early 2026, revealing evolving patterns in how organizations approach AI product development. The practitioners note a significant shift from 2024 to 2025, where initial skepticism has largely dissipated and companies have moved beyond simple "chatbot on your data" implementations to fundamentally rethinking user experiences and workflows. However, execution remains highly variable as teams navigate the three-year-old field without established playbooks or textbooks. A critical observation is that traditional role boundaries between product managers, engineers, and data professionals have broken down, requiring tighter collaboration as teams now jointly examine agent traces and make product behavior decisions together. ## Fundamental Differences: Non-Determinism and Agency-Control Tradeoffs The practitioners emphasize two fundamental differences that distinguish AI product development from traditional software engineering. The first is non-determinism operating on both input and output dimensions. Unlike traditional software like booking.com where user intentions flow through well-mapped decision engines with predictable button clicks and forms, AI products replace this layer with fluid natural language interfaces. This creates uncertainty on the input side, as users can express intentions in countless ways, and on the output side, where LLMs as probabilistic APIs are sensitive to prompt phrasing and operate as black boxes. The result is working with unpredictable user behavior, unpredictable model responses, and opaque internal processes simultaneously. The second critical difference is the agency-control tradeoff, which they note is surprisingly underexplored despite widespread obsession with autonomous agents. Every increment of decision-making capability granted to AI systems necessarily reduces human control. This creates a trust calibration challenge where systems must earn the reliability required before being granted increased autonomy. The practitioners argue this fundamentally changes product development strategy, requiring deliberate progression from constrained to autonomous operation rather than attempting full automation from day one. ## The CCCD Framework: Continuous Calibration Continuous Development Their solution is a structured framework called Continuous Calibration Continuous Development, explicitly designed to build AI products iteratively while minimizing risk to user experience and trust. The framework operates as a loop with two main phases. The Continuous Development phase involves scoping capabilities, curating datasets of expected inputs and outputs, setting up the application, and designing evaluation metrics. This initial dataset serves as crucial alignment documentation, often revealing that team members hold different expectations for product behavior before any code is written. The Continuous Calibration phase addresses behaviors that emerge only in production. After deployment, teams analyze behavior patterns, spot emerging error modes, apply fixes for observed issues, and crucially, design new evaluation metrics for previously unpredicted failure patterns. Not all errors warrant new evaluation metrics—spot issues like poorly defined tool schemas can simply be fixed and forgotten. The framework emphasizes that evaluation metrics catch only errors teams already know about, while production monitoring reveals novel failure patterns that require behavioral recalibration. ## Progressive Agency Implementation Central to the CCCD approach is starting with minimal agency and high human control, then progressively increasing autonomy. The practitioners provide concrete examples across multiple domains. For coding assistants, V1 might suggest inline completions and boilerplate snippets, V2 generates larger blocks like tests or refactors for human review, and V3 autonomously applies changes and opens pull requests. For marketing assistants, V1 drafts email or social copy, V2 builds and runs multi-step campaigns for approval, and V3 launches campaigns with autonomous AB testing and cross-channel optimization. They illustrate this with an extended customer support example that reveals real-world complexity. V1 focuses solely on routing tickets to appropriate departments. While this may seem trivial, the practitioners explain that enterprise taxonomies are often hierarchically messy, with shoes, women's shoes, and men's shoes appearing at the same level, alongside redundant deprecated nodes like "for women" and "for men" not updated since 2019. Human agents know implicit rules—like checking last update dates to identify dead nodes—that aren't documented anywhere. Starting with routing forces teams to discover and remediate these data quality issues while maintaining high human control, since misrouted tickets can be easily corrected. V2 transitions to a copilot model where the system suggests responses based on standard operating procedures, generating drafts that humans can modify. This phase provides free error analysis by logging how much of each draft is used versus modified, creating data for the improvement flywheel. V3 reaches end-to-end resolution where the system drafts responses and closes tickets autonomously, but only after demonstrating that humans rarely modify the drafts, indicating readiness for increased autonomy. ## Real-World Implementation Challenges The practitioners share candid implementation experiences that highlight operational realities. They describe building an end-to-end customer support agent that required eventual shutdown due to uncontrollable hot-fixing cycles. Starting with full autonomy meant encountering numerous emerging problems simultaneously without systematic methods to address them. They reference the Air Canada incident where an agent hallucinated a refund policy the company had to legally honor, exemplifying the risks of premature autonomy. In the underwriting domain, they built a system helping underwriters extract policies from 30-40 page loan applications. After three months of success and reported time savings, user behavior evolved unexpectedly. Underwriters grew confident enough to ask complex questions like "for a case like this, what did previous underwriters do?" While this seems like a natural extension to users, it requires fundamentally different system capabilities—parsing what "a case like this" means contextually (income range? geography?), retrieving historical documents, analyzing them, and synthesizing comparative responses rather than simple policy lookups. This illustrates how user behavior evolution necessitates continuous recalibration even after initial success. ## Evaluation Strategy: Beyond the False Dichotomy The practitioners reject the false dichotomy between evaluation-centric and production-monitoring-centric approaches. They argue both are essential and serve different purposes. Evaluations embody product knowledge encoded into trusted datasets representing critical behaviors the system must handle correctly. Production monitoring captures actual customer usage through both explicit signals like thumbs up/down and implicit signals like answer regeneration, which indicates the initial response failed to meet expectations. Their philosophy emphasizes that evaluation alone cannot predict the diverse ways systems might fail in production. For high-transaction applications, examining every trace is impractical, making production monitoring essential for identifying which traces warrant detailed examination. Once failure patterns are identified through monitoring, teams determine whether they're significant enough to warrant new evaluation datasets. This creates a feedback loop where evaluation datasets expand based on production learnings rather than attempting comprehensive upfront coverage. They note semantic diffusion has rendered "evals" nearly meaningless, as data labeling companies claim their experts write evals when they mean error analysis notes, product managers are told evals are the new PRDs, and practitioners use the term for everything from LLM judges to model benchmarks. One client claimed to "do evals" but simply checked LM Arena and Artificial Analysis benchmarks—which are model evaluations, not application evaluations. Despite terminological confusion, they assert all practitioners agree on the importance of actionable feedback loops, with implementation details varying by application context. ## Codex Team Practices at OpenAI Kiti provides specific insights from the Codex team, which takes a balanced approach combining evaluations with intense customer listening. Coding agents present unique challenges because they're built for customizability across diverse integrations, tools, and workflows rather than solving fixed top-N workflows. This makes comprehensive evaluation datasets impractical since the space of possible customer interactions is vast. However, the team maintains evaluations for core functionality to prevent regressions when making changes. For their code review product, which has gained significant traction internally and externally for catching bugs, they emphasize AB testing when making model changes or applying new reinforcement learning mechanisms. They monitor whether code reviews correctly identify mistakes and track user reactions, including extreme signals like users disabling the product entirely when annoyed by incorrect reviews. The team actively monitors social media for customer problems and responds rapidly. Additionally, each engineer maintains personal "evals"—collections of hard problems they test against new models during launches to understand capability progression. This distributed approach to quality assurance complements formal evaluation infrastructure. ## Organizational Success Factors The practitioners describe successful AI adoption as a triangle with three dimensions: leadership, culture, and technical execution. Leadership effectiveness strongly predicts success, particularly leaders' willingness to rebuild intuitions earned over 10-15 years. They cite the Rackspace CEO who blocks 4-6 AM daily for "catching up with AI" without meetings, plus weekend white-boarding sessions. This isn't about implementation but rebuilding intuitions to guide decisions, requiring vulnerability to accept that established instincts may not apply and being "the dumbest person in the room" willing to learn from everyone. Top-down support is essential since bottom-up approaches fail when leaders lack trust in or hold misaligned expectations about the technology's capabilities. Culture must emphasize empowerment over replacement fear. Subject matter experts are crucial for AI product success but often resist engagement when they perceive job threats. Organizations successfully frame AI as augmenting capabilities to 10X productivity and opening new opportunities rather than replacement. This requires leadership-set tone. The third dimension is technical execution where successful teams obsess over understanding workflows to identify automation-appropriate segments versus human-in-the-loop requirements. They recognize most automation combines machine learning models, LLM capabilities, and deterministic code rather than pure AI solutions, choosing appropriate tools for each workflow component. ## Production Lifecycle Considerations The practitioners emphasize that reliable AI products typically require four to six months of work even with optimal data and infrastructure layers. They're explicitly skeptical of vendors promising "one-click agents" that deliver significant ROI immediately, calling this pure marketing. The challenge isn't model capability but enterprise data messiness—functions named "get_customer_data_v1" and "get_customer_data_v2" coexisting, tech debt accumulation, and complex implicit rules. Agents need time to learn these system quirks, similar to how human employees require onboarding. They describe calibration completion indicators as minimizing surprise. When daily or bi-daily calibration sessions reveal no new data distribution patterns and consistent user behavior, information gain approaches zero, signaling readiness for the next autonomy stage. However, calibration can be disrupted by events like model changes—GPT-4 deprecation forcing migration to GPT-4.5 with different properties requires recalibration. User behavior also evolves as capabilities expand, as users excited by solving one task naturally attempt applying the system to adjacent problems without realizing the architectural changes required. ## Technical Patterns and Anti-Patterns On multi-agent systems, Kiti expresses that the concept is misunderstood rather than overhyped. Many teams approach complex problems by decomposing them into sub-agent responsibilities, expecting coordinated success through "gossip protocol" peer-to-peer communication. This rarely succeeds with current capabilities and tooling. Successful patterns include supervisor-subagent hierarchies where a supervisor coordinates specialized workers, or single agents with internal task decomposition. Peer-to-peer agent communication, especially in customer-facing scenarios, creates control problems as guardrails must be distributed everywhere without clear accountability for which agent responds to customers. They advocate for problem-first rather than tool-first thinking, noting it's easy to obsess over solution complexity while forgetting the problem being solved. Starting with constrained autonomy forces problem definition clarity. They emphasize that building has become cheap while design has become expensive—really understanding the problem and whether AI meaningfully addresses pain points is increasingly valuable. The field's rapid tool proliferation tempts constant experimentation, but deep problem understanding matters more than tool proficiency. ## Future Outlook and Emerging Capabilities Looking toward late 2026, Kiti anticipates significant progress in proactive background agents. Current AI value limitations stem largely from insufficient context, which results from systems not being plugged into where actual work happens. As agents gain access to more workflow context and understand the metrics users optimize for and activities they perform, the natural extension is agents prompting users rather than vice versa. ChatGPT Pulse already provides daily updates on potentially interesting topics, but this extends to complex tasks like a coding agent reporting "I've fixed five Linear tickets, here are patches for your morning review" or alerting about potential site crashes based on detected patterns. Ash emphasizes multimodal experiences as the key 2026 development. While language is a late evolutionary development for humans, people are fundamentally multimodal creatures constantly processing signals like head nods or boredom indicators that influence conversation direction—a "chain of thought behind your chain of thought." Current AI products haven't explored this expression dimension well. Better multimodal understanding promises humanlike conversational richness and unlocks numerous mundane but valuable use cases involving handwritten documents and messy PDFs that current models cannot parse, representing vast untapped data sources. ## Career Development in the AI Era For skill development, Ash emphasizes that implementation mechanics will become ridiculously cheap in coming years, making judgment, taste, and design the career differentiators. The early career focus on execution mechanics is rapidly shortening as AI helps people ramp quickly. She cites a recent hire who replaced their expensive task tracking subscription with a whitelabel custom app he built and brought to meetings, demonstrating agency and ownership to rethink experiences. This generation associates lower costs with building and enthusiastically tries new tools, though this contributes to AI product retention problems as users constantly experiment with alternatives. The era of busy work is ending—people cannot sit in corners doing work that doesn't move company needles, requiring focus on end-to-end workflows and impact. Kiti emphasizes persistence as the new moat. With information at fingertips and the ability to learn anything overnight, competitive advantage comes from enduring the pain of learning, implementing, and understanding what works versus what doesn't. Successful companies building in new areas succeeded not by being first or having popular features, but by systematically identifying non-negotiable requirements and trading them off against available model capabilities and features. Since no textbook provides this path, the organizational knowledge built through painful iteration becomes the defensible moat—converting coal into diamonds through sustained pressure. ## Practical Recommendations The practitioners offer several practical recommendations for teams. First, obsess over customers and problems rather than technology—AI is merely a tool. Self-described AI engineers and AI PMs spend 80% of their time understanding workflows and user behavior, not building fancy models or workflows. Software engineers new to AI often find "look at your data" revelatory, but this has always been essential. Understanding users and data creates huge differentiation. Second, be deliberate about when to graduate between autonomy levels. There's no single right answer, but the goal is minimizing surprise while building improvement flywheels without ruining customer experience. Third, recognize that comprehensive upfront evaluation datasets are impossible—emerging patterns only surface in production, requiring both evaluation and monitoring strategies. Fourth, ensure leadership deeply engages with the technology through hands-on experience and curated information sources rather than relying on outdated intuitions. Finally, build cultures of empowerment where subject matter experts engage without fearing replacement, enabling the tight collaboration essential for calibrating AI system behavior. The framework and practices they describe reflect hard-won lessons from production deployments across diverse enterprise contexts, providing a pragmatic counterweight to hype-driven approaches that promise immediate autonomous solutions. Their emphasis on iterative trust-building, behavioral calibration, and tight integration of evaluation with production monitoring offers a mature perspective on the operational realities of production LLM systems in 2025-2026.

Start deploying reproducible AI workflows today