OpenAI's Codex team demonstrates how they built and operate a production AI coding agent platform that enables developers to delegate complex software development tasks to LLMs. The team leverages their own product extensively in development, with designers writing more code than engineers did six months prior, and product managers submitting PRs directly. The solution includes multiple model tiers (GPT-5.4 for complex tasks, Codex Spark for rapid iteration at 1,200 tokens/second), a multi-agent architecture that allows parallel task execution, and an open-source harness that powers CLI, IDE extensions, and a standalone app. Results include 20-30x user growth in months, adoption across OpenAI internally as a primary development tool, and a development workflow where specs are minimal (around 10 bullets) with emphasis on rapid prototyping and community-driven iteration.
This case study provides deep insight into how OpenAI’s Codex team builds and operates a production AI coding agent platform while simultaneously using that same platform to develop itself. The discussion features Alex and Roman from the Codex team explaining their approach to building features, shipping continuously, and managing the full lifecycle of LLM-powered coding agents in production.
The Codex platform represents a sophisticated LLMOps implementation where the team has operationalized multiple LLMs for code generation and task automation. The platform serves both external developers and internal OpenAI teams, with the vast majority of OpenAI employees now using the Codex app as their primary development interface, including non-technical users.
A core aspect of the Codex LLMOps implementation is the strategic deployment of multiple models with different performance characteristics. The team operates a tiered model approach where GPT-5.4 serves as the frontier model for handling complex tasks involving millions of lines of code analysis or large-scale migrations. For interactive development and rapid iteration, they deploy Codex Spark, which achieves approximately 1,200 tokens per second and enables near-instantaneous feedback loops.
The team emphasizes that model selection is contextual and user-driven. Developers can choose between models based on task complexity and desired speed. The demonstration shows both models running in parallel on the same prompt, with GPT-5.4 given a head start but still being outpaced by Codex Spark for simpler tasks. This dual-model strategy reflects sophisticated LLMOps thinking about matching model capabilities to specific use cases in production.
The platform also implements what they call “fast mode,” which appears to be an intermediate option between the frontier model and Codex Spark, providing users with granular control over the speed-quality tradeoff. This flexibility is critical for maintaining developer flow states while ensuring complex tasks receive appropriate model capacity.
The Codex platform is fundamentally designed around agentic task delegation, representing one of the more advanced implementations of multi-agent LLM systems in production. The architecture evolved in response to model capability improvements, particularly around the GPT-5.2 release in December, when models became sufficiently reliable to work autonomously for hours or even days.
The system supports parallel execution of multiple agents, each working on independent tasks. The team explicitly designed their application to dissociate from specific workspaces, enabling users to delegate tasks across multiple codebases or projects simultaneously. This architectural decision emerged from their long-term vision that developers would need to manage multiple agents working independently in the cloud.
The platform includes sophisticated orchestration where a single agent can coordinate multiple sub-agents. While this sub-agent functionality exists in the codebase and is being explored by power users, the team hasn’t yet made it a primary product feature, demonstrating their staged rollout approach for advanced capabilities.
Plan mode represents another dimension of the agentic architecture. When users begin planning tasks, the system automatically detects this intent and enters a specialized planning mode (triggered by shift-tab). In this mode, the agent analyzes the existing codebase, proposes next steps, and engages in interactive dialog to refine the plan before execution. This shows sophisticated prompt routing and mode-switching in production.
A critical LLMOps component of Codex is the skills system, which extends the platform’s capabilities through structured tool integrations. Skills enable the coding agents to interact with external systems and services, effectively functioning as a production implementation of tool-calling for LLMs.
The platform includes native integrations with design tools like Figma, allowing agents to pull design specifications, React components, and design variables directly from Figma files and implement them in code. Deployment skills connect to platforms like Vercel, Cloudflare, and Render, enabling agents to deploy applications autonomously.
Project management integrations demonstrate the breadth of the skills ecosystem. Users can instruct Codex to create tasks in Linear, a project management tool, and the agent can subsequently implement those tasks and mark them complete autonomously. One user anecdote describes telling Codex to write all tasks to Linear, implement them overnight, and cross them off, showcasing end-to-end autonomous operation.
The skills architecture appears designed for extensibility, with the team emphasizing the ecosystem of tasks that developers can access. This suggests a plugin or marketplace model where skills can be developed and shared, though the details about the technical implementation of skill creation are not explicitly covered.
The Codex platform demonstrates sophisticated LLMOps architecture through a shared core harness written in Rust that powers multiple user interfaces. This open-source harness serves as the foundational layer for the CLI, IDE extensions (supporting VS Code, Cursor, Windsurf, and others), and the standalone Codex app.
This architectural pattern provides several LLMOps advantages. Code sharing across interfaces ensures consistent agent behavior and capabilities regardless of entry point. The open-source nature enables the community to contribute, fork, and experiment with cutting-edge features before they reach production. The team notes that power users frequently break unreleased features because they’re actively modifying the harness code directly.
The IDE extensions share substantial code with the app, suggesting a modular architecture where UI layers can be swapped while maintaining core agent functionality. This separation of concerns enables the team to iterate on user experience independently from agent capabilities.
The decision to build a standalone app separate from IDE extensions represents a significant architectural choice driven by the need to support multi-agent workflows. Traditional IDEs are tied to specific workspace folders, limiting users to working on one codebase at a time. The standalone app enables users to manage multiple agents working across different projects simultaneously, supporting the delegation-oriented workflow central to their vision.
The Codex team’s internal development process exemplifies LLMOps in practice, with the team using Codex extensively to build Codex itself. This self-hosted approach provides immediate feedback on product quality and capabilities.
Specifications are deliberately minimal, typically consisting of around 10 bullet points. The team only writes detailed specs when problems are too complex to fit in one person’s mental model or when coordinating across multiple people. Even then, documentation remains extremely concise. This approach is enabled by Codex itself, as individual contributors can accomplish far more by delegating coding to agents, reducing the need for extensive upfront planning.
Designers on the team now write more code than engineers did six months prior, demonstrating how LLM-powered coding agents blur traditional role boundaries. Product managers regularly submit pull requests directly, finding it faster to make small changes themselves than to communicate requirements and wait for prioritization. This reflects a fundamental shift in development workflows enabled by AI coding agents.
The team operates in distinct modes. Execution mode focuses on obsessing over quality, testing thoroughly, and shipping features, with heavy Codex usage for understanding feedback (via Slack summarization), analyzing code, and making changes. Strategic mode involves more thinking about direction and coordination, with less direct coding but more use of Codex for communication and analysis tasks.
The Codex team follows a distinctive approach to planning timeframes that reflects the uncertainty of working at the frontier of AI capabilities. They plan either near-term (up to eight weeks maximum) or long-term (one year visionary thinking) but explicitly avoid medium-term planning.
Near-term planning focuses on concrete deliverables that can motivate team rallying. Long-term planning involves developing “vibes” about future model capabilities and how users will work with them. For example, a year ago they envisioned models smart enough that users would want infinitely many models working independently, validating their own work, deploying code autonomously, and potentially not requiring explicit prompting.
The in-between product roadmap is considered too difficult to plan given the rapid pace of model improvement. This philosophy led to the Codex app development, which began without detailed specifications but rather with strategic goals like dissociating from specific workspaces to enable multi-agent workflows.
The Codex app emerged from multiple independent prototypes built during a hack week, with the decision to build it being more contentious than the eventual design. The team had to decide between investing in the already-popular IDE extension, the CLI, or a new app. The rationale for the app centered on it being beginner-friendly, intuitive for playing and discovering features, and the best interface for managing multiple agents simultaneously.
A key LLMOps design principle for Codex is balancing extreme configurability for power users with simplicity for beginners. The team recognizes that developers love building tools for themselves and automating their workflows, making configurability essential.
The open-source harness enables power users to modify code directly, with users forking the repository and implementing unreleased features themselves. The team receives Twitter complaints about broken features that aren’t even enabled in production yet because users are so deeply customizing their setups. This cutting-edge user behavior provides valuable signal about future directions.
The team carefully considers core primitives that define the system, ensuring these are well-architected and thoughtfully designed rather than “vibe coded.” From this foundation, they package features in highly configurable ways for power users to experiment with. Sub-agents serve as an example of functionality available for experimentation but not yet proactively triggered in the product.
The final layer involves making advanced capabilities simple for mainstream users through progressive disclosure. The Codex app exemplifies this approach: it launches as a simple chat interface, users discover the sidebar for managing multiple tasks, then uncover the skills tab for ecosystem integrations. The design mimics game progression, encouraging exploration and discovery.
The team identifies two major “vibe shifts” in Codex history that demonstrate how LLMOps products must adapt to model capability improvements. The first occurred around August with GPT-5’s release as a great interactive coding model. The team had launched Codex Cloud slightly early for the model capabilities at the time, so they pivoted to shipping the CLI and IDE extension to solve problems models could actually handle. Growth exploded, increasing 20-30x in a few months.
The second inflection point came in December-January with GPT-5.2 and subsequent models that could reliably execute longer autonomous tasks. This capability shift enabled returning to the original vision of task delegation and autonomous agents working for hours or days. The team recognized that model improvements had created the conditions for the Codex app to succeed where the earlier cloud product had struggled.
This adaptive approach to product development based on model capabilities represents sophisticated LLMOps thinking. Rather than forcing a product vision that exceeds current model abilities, the team iterates on what’s possible now while maintaining a long-term vision they can pursue as models improve. The standing joke about Peter Steinberger, a prolific engineer who initially said he’d never use a UI beyond terminal windows but later tweeted that the app was “pretty good,” illustrates how dramatically model improvements can change user behavior and preferences.
While specific testing frameworks aren’t detailed, quality management emerges as a central concern. During execution mode, the product manager spends significant time using Codex to understand the state of quality, having the agent summarize Slack feedback and create Linear issues for tracking.
The team emphasizes that while most code is agent-generated, they invest heavily in ensuring high quality. They distinguish between the ability to generate code and the responsibility to maintain systems, noting that product managers can prototype features but shouldn’t own complex systems in production because they need to remain flexible and distracted by nature.
For complex features, the team ensures stable ownership by more senior engineers who can maintain quality over time. This suggests a tiered approach where trivial changes can be made by anyone using Codex, but significant systems require dedicated ownership and quality oversight.
The team’s practice of using Codex extensively for their own development provides built-in quality feedback. When the vast majority of OpenAI uses the Codex app, including the team building it, quality issues surface immediately through dogfooding.
The Codex team’s open-source approach fundamentally shapes their LLMOps practice. Being open source drives transparency about everything they build, and the community rewards this openness with engagement and contributions.
The team works closely with users through various channels. Developer experience team members like Roman coordinate wide alpha testing programs, build with users to gather feedback, and develop skills for the app based on community needs. Codex ambassadors in multiple cities and countries organize local events, hackathons, and educational sessions, extending the team’s reach globally.
The community provides cutting-edge signal through their modifications and experiments with the open-source harness. Power users pull unreleased features into production for themselves, discovering issues and use cases the core team hadn’t anticipated. This distributed experimentation accelerates learning and validates potential product directions.
Social media, particularly Twitter, serves as a primary feedback channel. The team is “quite online,” with launches being feedback-oriented and anchored around the online community. Product manager Alex notes spending more time on Twitter during execution mode, using social media both for gathering feedback and energizing the community.
The open-source strategy also facilitated hiring Peter Steinberger, creator of OpenClaw, who had built over 40 open-source projects aligned with the vision of command-line interfaces for various services. His work on skills and CLI tools for calendars, tweets, Gmail, and other services created an ecosystem that Codex could leverage, demonstrating how community contributions can accelerate product capabilities.
As Codex has matured, cross-functional alignment has become increasingly important. Within the Codex team, alignment remains minimal, with the team operating as an intentional “pirate ship” where everyone works fluidly together with few formal processes.
However, Codex’s utility extends beyond coding, with users employing it for tasks across the software development lifecycle. The majority of OpenAI uses the Codex app, including non-technical users, raising questions about how Codex relates to other products like ChatGPT. This requires thoughtful cross-functional coordination to determine appropriate positioning and integration.
The developer experience team increasingly views Codex as the cornerstone of the entire OpenAI developer platform. For millions of developers building on OpenAI’s API using various modalities (image, Sora, speech-to-speech), Codex has become the primary entry point. Where a year ago the team wrote extensive guides on prompting GPT-5, they now teach developers to use Codex with skills for integrations and updates, letting the agent handle the complexity.
This shift represents a significant LLMOps evolution where the coding agent becomes the interface layer for an entire platform ecosystem, abstracting away complexity and enabling natural language interaction with sophisticated developer tools.
The hiring philosophy for the Codex team emphasizes agency above all else. They seek people who do things, self-start, and don’t require structured onboarding with predefined task sequences. New team members receive minimal direction beyond “welcome,” reflecting an environment where individuals must identify problems, propose solutions, and drive execution independently.
The team values disagreement and independent thinking, expecting new hires to challenge existing decisions that were often made accidentally rather than deliberately. They look for people who absorb incremental scope and take accountability for unknown areas, expanding the team’s overall capacity.
For technical roles, mastery of Codex itself is expected. For developer experience roles, the ideal profile combines strong engineering skills with excellence in community engagement and social media presence. The team looks for people who love spending time with developers and builders, sharing knowledge through content and events.
Agency manifests even before the interview process. When someone reaches out expressing interest, the team prioritizes reviewing links to projects they’ve built and reading their ideas over traditional CVs and cover letters explaining qualifications. They confess to not knowing where team members attended college, emphasizing demonstrated building ability over credentials.
The team believes traditional career ladders are blurring as AI tools enable designers to become more technical, product managers to prototype, and engineers to do more design. They question whether many teams still need dedicated product managers, particularly in startups with fewer than 20 engineers, viewing the role as “fill in the gaps” rather than leadership. The key remaining question is what individuals are most interested in, with tools enabling everyone to spend more time on what energizes them while delegating other work to agents.
While not extensively detailed, the demonstration includes voice-based interaction with Codex, with users dictating feature requests like “add a new screen for NASA’s Artemis mission, return to the moon” for an iOS app. This multimodal capability extends the accessibility of the platform beyond text-based prompting.
The voice interface enables more natural interaction during active development. The demonstration shows popping out the Codex conversation window above the development environment, allowing users to continue working while iterating verbally. This conversational flow supports rapid iteration cycles where developers can request changes like adding decorations and trees to a game while continuing to play it, seeing changes appear within seconds.
The integration of voice represents an LLMOps consideration for reducing friction in developer workflows, enabling hands-free interaction and potentially supporting pair programming scenarios where verbal communication feels more natural than typing.
While specific infrastructure details aren’t provided, several aspects of the deployment architecture emerge. The platform supports both local and cloud execution, with agents able to work on local codebases or in cloud environments. The team emphasizes that starting in cloud can be difficult due to environment setup challenges and the inability to easily course-correct partially completed tasks.
The local-first approach provides several advantages: developer tools are immediately available, environment is already configured, and developers can easily jump in to fix or adjust agent work. However, the team maintains the long-term vision of cloud-based agents working autonomously, with the current architecture designed to eventually scale in that direction.
The skills system suggests a service-oriented architecture where Codex agents can call out to external APIs and services like Figma, Vercel, Cloudflare, Render, and Linear. This implies robust error handling and authentication management across diverse external systems.
The streaming capabilities demonstrated by Codex Spark (1,200 tokens/second) indicate sophisticated optimization for rapid token generation and delivery, likely involving model serving infrastructure tuned for low-latency responses.
The team articulates a clear long-term vision of personal agents that extends beyond coding. Peter Steinberger’s integration into the team focuses on building next-generation personal agents into ChatGPT, leveraging his experience building OpenClaw with extensive service integrations for calendar, email, Twitter, and other applications.
The vision involves agents that work across all aspects of life, not just software development. Users describe connecting OpenClaw to bank information, YouTube, Google services, and personal calendars, creating a truly personal agent accessible through conversation. The team sees Codex as a stepping stone toward this broader agent future, with coding serving as an initial domain where agents can demonstrate value.
The emphasis on agentic delegation, autonomous operation, and multi-agent orchestration all point toward a future where users manage teams of AI agents working on their behalf, potentially without explicit prompting, validating their own work and handling deployment and monitoring autonomously. The Codex platform serves as the proving ground for this vision, with coding providing a constrained domain for developing and refining these capabilities before expanding to broader applications.
The team’s approach of planning long-term vibes while executing short-term concrete deliverables enables them to pursue this ambitious vision while remaining grounded in current capabilities and user needs. Their open-source strategy and community engagement ensure they’re learning from cutting-edge users who pull the future forward, creating a feedback loop that accelerates progress toward their vision of ubiquitous AI agents augmenting human capability across all domains.
This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.
Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.
Shopify's CTO discusses how the company has achieved near-universal AI adoption internally, with nearly 100% of employees using AI tools daily as of December 2025. The company has developed sophisticated internal platforms including Tangle (an ML experimentation framework), Tangent (an auto-research loop for automatic optimization), and SimGym (a customer simulation platform using historical data). These systems have enabled dramatic productivity improvements including 30% month-over-month PR merge growth, significant code quality improvements through critique loops, and the ability to run hundreds of automated experiments. The company provides unlimited token budgets to employees and emphasizes quality token usage over quantity, focusing on efficient agent architectures with critique loops rather than many parallel agents. They've also implemented Liquid AI models for low-latency applications, achieving 30-millisecond response times for search queries.