OpenAI: Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

Overview

This case study provides an in-depth look at OpenAI’s Forward Deployed Engineering (FDE) organization through an interview with Colin Jarvis, who leads the team. The FDE team was established to address the challenge that while ChatGPT generated enormous excitement, getting enterprise value from LLMs in production environments was proving difficult and inconsistent. The team represents OpenAI’s “5%” - the enterprises successfully deploying AI at scale - in contrast to the widely cited MIT study finding that 95% of enterprise AI deployments fail.

Colin Jarvis joined OpenAI in November 2022, the month ChatGPT launched, when the company had fewer than 200 people. The FDE practice grew from just 2 people at the start of 2024 to 39 at the time of the interview, with plans to reach 52 by year-end 2024. This rapid expansion reflects both the demand for production LLM deployments and the specialized skills required to make them successful.

The Forward Deployed Model and Philosophy

The FDE model at OpenAI is explicitly inspired by similar practices at companies like Palantir, where team members deeply embed with customers to understand their domain and deliver working solutions. The philosophy centers on “eating pain and excreting product” - immersing in difficult customer problems to extract generalizable product insights and platform capabilities. However, OpenAI takes a strategic approach distinct from traditional consulting: the FDE team is positioned as a zero-to-one team focused on breaking the back of novel, high-value problems rather than long-term service delivery.

The team deliberately targets problems representing tens of millions to low billions in value, ensuring that solved problems have significant economic impact and that learnings justify the investment. This selective approach allows OpenAI to maintain focus on product development and research insights rather than being drawn into pure services revenue. The FDE team explicitly splits capacity along two axes: some engagements have clear product hypotheses where they seek perfect design partners, while others target industries with interesting technical problems (like semiconductors or life sciences) where they expect research learnings even without immediate product direction.

Morgan Stanley: The Foundational Case Study

Morgan Stanley was OpenAI’s first enterprise customer to deploy GPT-4 in 2023, and this engagement helped establish the FDE practice. The use case involved putting Morgan Stanley’s wealth management research into the hands of all wealth advisors through an AI-powered system. This exemplifies a key pattern: successful enterprise deployments tackle genuinely high-stakes use cases at the core of the business rather than edge cases.

The technical challenge was formidable because at that time, RAG (Retrieval Augmented Generation) wasn’t yet an established pattern. The team had to develop retrieval tuning techniques to ensure research reports could be accurately surfaced and trusted. The technical pipeline was built within 6-8 weeks, including retrieval optimization, guardrails, and basic evaluation frameworks. However, the critical insight was that technical readiness wasn’t sufficient - it took an additional 4 months of pilots, user feedback collection, evaluation refinement, and iteration to build trust with wealth advisors.

This extended trust-building phase is particularly important in regulated financial environments where accuracy requirements are high and the technology is probabilistic. The FDE team worked closely with advisors to label data, verify outputs, and develop verification tools for cases where confidence was lower. The result was exceptional: 98% adoption among wealth advisors and a 3x increase in research report usage, demonstrating both technical success and user acceptance.

Semiconductor Manufacturing: Complex Verification Workflows

One of OpenAI’s largest ongoing projects involves a European semiconductor company, illustrating how FDE tackles complex technical domains. The engagement started with OpenAI embedding on-site for several weeks to understand the entire value chain: chip design, verification, and performance measurement. The team identified verification as the highest-value target, as engineers spend 70-80% of their time on bug fixing and maintaining compatibility rather than new development.

The FDE team delivered 10 different use cases across the value chain, currently achieving 20-30% efficiency savings in early divisions with a target of 50% overall. A key example is the “debug investigation and triage agent” built on top of Codex. Engineers face hundreds of bugs each morning from overnight test runs. The initial solution had the model investigate bugs and write detailed tickets explaining probable causes. As trust developed, the system evolved to attempt fixes and raise pull requests automatically. The team added execution environments so the model could test its own code iteratively.

This case demonstrates the careful orchestration between LLM capabilities and deterministic systems. The team forked Codex and added extensive telemetry to build detailed evaluations based on expert trajectories - the actual sequence of 20+ actions a human engineer would follow. They worked with customer experts to create labeled evaluation sets before beginning development. The philosophy is that “eval-driven development” ensures no LLM-based code is considered done until verification exists. As models improve, the evaluation framework provides a consistent way to measure progress. The FDE team adds scaffolding around the model to ensure certain components remain deterministic while leveraging the LLM’s strengths for tasks requiring nuance.

The vision is that by mid-2025, engineers will arrive to find most bugs already fixed, with the hardest ones clearly documented, allowing them to focus primarily on writing new code rather than context-switching between bug fixing and development.

Automotive Supply Chain: Balancing Probabilistic AI with Deterministic Constraints

An automotive manufacturing customer in APAC presented a complex supply chain coordination problem. Normally, disruptions like tariff changes required manual coordination across manufacturing, logistics, and procurement teams through phone calls and meetings, taking hours or days to analyze impacts and develop response plans. The FDE team built a data layer with APIs to enable LLM orchestration across these systems without moving data.

The demonstration showed a system responding to a hypothetical 25% tariff on goods from China to South Korea. The solution architecture embodies a core FDE principle: use determinism wherever possible and LLMs only where their probabilistic nature adds value. The team implemented hard constraints that must be verified deterministically - for example, always maintaining at least two suppliers for critical components like tires, meeting lead time requirements, and ensuring all materials have coverage. These constraints are checked 100% of the time through deterministic code rather than trusting the LLM.

The system first uses the LLM for business intelligence, querying various databases and generating insights that previously required coordination across BI teams. It presents tariff impacts with explanations, provides detailed tables for verification, and offers map visualizations. The real value comes in optimization, where the system runs complex simulations to find the best combination of factories and suppliers to minimize cost and lead time. Rather than asking the LLM to optimize directly, it’s given access to a simulator and allowed to explore the parameter space as an educated business user would.

In the demonstration, the system ran five optimization scenarios and recommended the best trade-off. In production deployments, this approach scales to hundreds or thousands of simulations run offline, with the agent returning well-documented recommendations. The customer-conservative approach included showing reasoning explanations before actions, providing verification widgets, and exposing detailed tables for manual checking. This layered approach builds trust while democratizing access to complex analytical capabilities.

Klarna and T-Mobile: From Custom Solution to Platform Product

The journey from Klarna’s customer service application to OpenAI’s Agent SDK and Agent Kit illustrates how FDE extracts product from customer pain. In 2023, Klarna faced a scalability problem: manually writing prompts for 400+ policies was unsustainable. Colin Jarvis worked with them to develop a method of parameterizing instructions and tools, wrapping each intent with evaluation sets to enable scaling.

This pattern worked well enough that OpenAI codified it into an internal framework called Swarm, which was eventually open-sourced. The framework received significant community traction on GitHub. Meanwhile, the FDE team started an engagement with T-Mobile on customer service that was “10x more complex” in volume, policy count, and policy complexity. The Swarm primitives proved effective with extensions, validating the generalizability.

This convergence - production success with multiple customers, open-source validation, and clear product hypothesis - led OpenAI’s product team to build the Agent SDK. More recently, this evolved into Agent Kit, a visual builder that makes the underlying framework more accessible. The progression from solution architecture through FDE-style delivery to internal framework, open-source validation, product team adoption, and finally mainstream product release demonstrates the intended FDE-to-product pipeline.

The key insight is that reusability emerged from solving real customer problems rather than trying to generalize too early. The team learned that starting with high-concept generalized solutions without clear problems leads to failure, whereas deeply solving specific customer problems almost always reveals generalizable patterns.

Evaluation-Driven Development and Trust Building

A consistent theme across deployments is the centrality of evaluation frameworks. The FDE team’s approach starts with deep domain understanding, then creates detailed evaluation sets before significant development begins. For the semiconductor example, this meant working with customer experts to define trajectories - the sequence of actions an expert would take to solve specific problems. These become labeled evaluation sets against which the LLM’s performance is measured.

The philosophy is that LLM-based applications aren’t complete without evaluations verifying efficacy. This “eval-driven development” approach provides several benefits: it forces clarity about success criteria, enables objective measurement of progress as models improve, builds customer confidence through transparency, and creates feedback loops for iterative improvement.

The extended trust-building phases (4 months for Morgan Stanley, similar timelines elsewhere) aren’t just about improving accuracy - they’re about developing shared understanding between users and AI systems, establishing verification mechanisms, and building organizational confidence. In regulated environments or high-stakes applications, this investment in trust is non-negotiable and represents a significant portion of the deployment timeline even after technical readiness.

Technical Architecture Patterns

Several technical patterns emerge across FDE deployments:

Orchestration with Guardrails: LLMs serve as orchestrators across complex systems, but with deterministic guardrails protecting critical constraints. The automotive supply chain example explicitly separated concerns: deterministic checks for hard requirements (supplier minimums, lead times, material coverage) and probabilistic LLM reasoning for optimization and insight generation.

Retrieval and Data Layers: Rather than moving data, the FDE team often builds translation or metadata layers that enable LLM access. This addresses the classic problem of whether to centralize data or use it in place. With LLMs capable of generating queries, the question becomes whether data needs to move at all. Colin identified this “metadata translation layer” as an underrated space with significant potential, drawing parallels to traditional business intelligence but adapted for LLM consumption.

Tool Access and Execution Environments: Agents are given tools appropriate to their tasks, from APIs and simulators to execution environments for testing code. The semiconductor debugging agent has an execution environment to test its fixes iteratively. The supply chain system has access to simulators to explore trade-offs. This pattern of giving LLMs the tools experts would use, combined with appropriate guardrails, enables sophisticated problem-solving.

Telemetry and Observability: The team adds extensive telemetry to production deployments, enabling detailed understanding of model behavior, identification of failure modes, and continuous improvement. This observability is essential for both building trust and iterating toward better performance.

Hybrid Deterministic-Probabilistic Design: A core architectural principle is recognizing when to use determinism versus probabilistic reasoning. Critical business rules, mathematical constraints, and verification steps should be deterministic code, not LLM outputs. LLMs are most valuable for tasks requiring nuance, natural language understanding, complex reasoning, and handling variability.

The Product vs. Services Tension

Colin emphasizes the strategic choice between services revenue and product development, noting that consulting firms often fail to make the transition because short-term services revenue pulls the organization away from strategic product bets. At OpenAI, the FDE team’s heart as a research-then-product company helps maintain focus on platform development over services revenue.

The team explicitly avoids being “a cast of thousands” and instead remains selective, ensuring solved problems either push research in new directions or have clear paths to platform products. They’re willing to turn down lucrative services opportunities that don’t advance strategic goals. However, there’s nuance: sometimes economically valuable problems are pursued even without clear product hypotheses if the research learnings justify the investment - if making models better at such problem-solving would benefit OpenAI broadly.

The capacity split reflects this: some engagements target specific product hypotheses with ideal design partners, while others explore industries with interesting technical problems to extract research insights. The intended motion is zero-to-one with the first customer (maybe 20% reusable), two to three more iterations reaching 50% reusability, then pushing into scaled business operations for broad market deployment.

Mistakes and Lessons Learned

The biggest mistake Colin identifies is “generalizing too early” - looking at ChatGPT features and trying to create generalized enterprise solutions without deeply solving specific customer problems. This leads to “high-concept solutions without clear problems” that don’t gain traction. Conversely, going “super deep on the customer’s problem” almost always yields generalizable insights.

This echoes Paul Graham’s advice about doing things that don’t scale early on. The FDE watchword is explicitly “doing what doesn’t scale” to understand problems deeply before attempting to generalize. The Swarm-to-Agent-Kit progression illustrates this: generalization emerged naturally from solving real problems, validated by multiple customers and open-source adoption before being productized.

Tools and Technology Stack

Several specific tools and technologies are mentioned:

OpenAI Playground: Colin identifies this as underrated for quickly validating use case feasibility. The ability to interact directly with the API through a simple UI enables rapid iteration and sense-checking. He describes using it to validate browser automation use cases with N=10 tests - if 7-8 succeed, the use case likely works in production.

Codex: Described as transformative for its ability to work autonomously. Colin’s “aha moment” was returning from four hours of meetings to find work completed. For the semiconductor engagement, Codex was forked and extended with domain-specific capabilities.

Swarm Framework: The internal-then-open-source framework for parameterizing instructions and tools, scaling from tens to hundreds of policies with evaluation wrappers. This became the foundation for the Agent SDK.

Agent SDK and Agent Kit: The productized evolution of Swarm primitives, with Agent Kit providing a visual builder interface for broader adoption.

MCP (Model Context Protocol): Mentioned as a starting point for data connectivity, though the FDE team typically builds additional logic layers between raw MCP connectors and LLM consumption.

DALL-E: Featured in the Coca-Cola “Create Real Magic” campaign, an early engagement that required tuning DALL-E 3 pre-release to generate perfect Christmas imagery while managing jailbreak risks.

Future Predictions and Industry Direction

Colin predicts 2026 might be the “year of fine-tuning” or “one-click to production” rather than just agents. His reasoning is that the building blocks for complex agentic systems are now in place, and the next frontier is optimization: taking orchestrated agent networks with established plumbing, generating training data from their operation, labeling it efficiently, and fine-tuning models for specific domains like chip design or drug discovery. This would move from “agents being used” to “agents being used perfectly for specialist domains.”

The progression reflects maturity: first establishing that complex tasks are possible (the agent era), then optimizing for specific domains through specialization (the fine-tuning optimization era). The infrastructure for creating training data, building evaluation sets, and managing the development lifecycle is now established, enabling this next evolution.

Organizational and Hiring Insights

The FDE team’s rapid growth from 2 to 52 people in a year reflects both demand and the specialized nature of the role. Colin notes these are among the hardest positions to hire for, requiring deep technical skills, customer empathy, domain adaptability, and comfort with ambiguity. The team must navigate between research, product development, and customer delivery while maintaining strategic focus.

The origin story - starting with one or two people in Europe doing their first engagement with John Deere - illustrates the bootstrap phase before scaling. The team’s structure reflects the dual focus on product-hypothesis-driven engagements and exploratory research-driven industry deployments.

Strategic Positioning and Market Context

The interview addresses the “AI bubble” narrative and MIT study finding 95% of enterprise deployments fail. OpenAI’s FDE team positions itself as “the 5% making them work” - the specialized capability turning LLM potential into production reality. This positioning emphasizes that successful enterprise AI deployment isn’t just about model capabilities but requires deep domain understanding, careful engineering, trust building, and organizational change management.

The B2B-B2C pendulum at OpenAI also emerges as context. Colin describes a low point between the first Dev Day (late 2023, featuring the Assistants API) and the next one, when company focus tilted heavily toward consumer products. Despite shipping major enterprise wins like Morgan Stanley and Klarna, the FDE team felt their work wasn’t prioritized. Open-sourcing Swarm was partly driven by lack of internal interest in B2B-oriented frameworks during this period. The pendulum swung back toward B2B in late 2024, leading to FDE team expansion approval.

This organizational context highlights that even within OpenAI, maintaining strategic focus on enterprise deployment required navigating competing priorities and demonstrating value through customer success and open-source validation.

Conclusion

The OpenAI FDE practice represents a sophisticated approach to enterprise LLM deployment that balances immediate customer value with long-term product and research goals. The key principles - deep domain embedding, evaluation-driven development, strategic problem selection, careful orchestration of deterministic and probabilistic components, extended trust-building, and extracting generalizable products from specific solutions - provide a model for successful production LLM deployment. The rapid growth and high-value outcomes demonstrate both the demand for this capability and the effectiveness of the approach in making enterprise AI deployments actually work.

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

Industry

Technologies