OpenAI: Forward Deployed Engineering for Enterprise LLM Deployments

Company

OpenAI

Title

Forward Deployed Engineering for Enterprise LLM Deployments

Industry

Tech

Link

https://www.youtube.com/watch?v=cBD7_R-Cizg

Year

2025

Summary (short)

OpenAI's Forward Deployed Engineering (FDE) team embeds with enterprise customers to solve high-value problems using LLMs, aiming for production deployments that generate tens of millions to billions in value. The team works on complex use cases across industries—from wealth management at Morgan Stanley to semiconductor verification and automotive supply chain optimization—building custom solutions while extracting generalizable patterns that inform OpenAI's product development. Through an "eval-driven development" approach combining LLM capabilities with deterministic guardrails, the FDE team has grown from 2 to 52 engineers in 2025, successfully bridging the gap between AI capabilities and enterprise production requirements while maintaining focus on zero-to-one problem solving rather than long-term consulting engagements.

## Overview This case study provides a comprehensive look at OpenAI's Forward Deployed Engineering (FDE) practice through an interview with Colin Jarvis, who leads the team. The FDE function represents OpenAI's specialized approach to enterprise LLM deployments, serving as a bridge between research capabilities and production enterprise applications. Starting with just 2 engineers at the beginning of 2025, the team has grown to 52 by year-end, reflecting the massive demand for hands-on expertise in deploying LLMs at scale in enterprise environments. The forward deployed engineering model at OpenAI deliberately targets high-value problems—typically generating or saving tens of millions to low billions in value—with the dual mandate of getting customers to production while extracting learnings that can inform OpenAI's product roadmap. This approach positions FDE as a strategic "SWAT team" rather than a traditional consulting organization, focusing on zero-to-one problem solving rather than long-term operational support. ## Early Examples and Pattern Establishment The Morgan Stanley wealth management deployment in 2023 established many of the patterns that would become central to OpenAI's FDE methodology. This was OpenAI's first enterprise customer to deploy with GPT-4, and the use case involved making research reports accessible to wealth advisors through an LLM-powered retrieval system. Critically, this was before RAG (Retrieval-Augmented Generation) had become a standardized approach, requiring the team to develop custom retrieval tuning methods. The Morgan Stanley engagement revealed a fundamental insight about enterprise LLM deployment: technical capability and production readiness operate on very different timelines. While the core technical pipeline—including retrieval, guardrails, and basic functionality—was operational within six to eight weeks, it took an additional four months of piloting, collecting evaluations, and iterative refinement to build sufficient trust among wealth advisors for actual adoption. The payoff was substantial: 98% adoption among advisors and a 3x increase in research report usage. This pattern of relatively quick technical validation followed by extended trust-building would become a recurring theme in FDE engagements. The Morgan Stanley case also highlighted the importance of selecting genuinely high-stakes use cases rather than edge cases. Wealth management represents a core business function for Morgan Stanley, and this strategic positioning ensured both organizational commitment and meaningful impact measurement. Working in a regulated environment also forced the development of rigorous accuracy frameworks suited to probabilistic technologies—lessons that would transfer to other regulated industries. ## Technical Methodology: Eval-Driven Development OpenAI's FDE team has developed what they term "eval-driven development" as their core technical methodology. This approach recognizes that LLM applications in production cannot rely purely on probabilistic behavior but require a sophisticated interplay between LLM capabilities and deterministic verification systems. The semiconductor company engagement provides a detailed illustration of this methodology in practice. Working with a European semiconductor manufacturer, the FDE team embedded on-site to understand the full value chain—from chip design through verification to performance measurement. They identified verification as the highest-value target, where engineers spent 70-80% of their time on bug fixing and compatibility maintenance rather than new development. The team's approach with the semiconductor customer demonstrates the layered nature of eval-driven development. They started by forking Codex (OpenAI's code generation model) and adding extensive telemetry to enable detailed evaluation tracking. Working with customer experts, they created labeled evaluation sets that captured the trajectory a human engineer would follow when debugging—typically a sequence of about 20 different actions checking various logs and system states. The development process itself followed an iterative pattern of increasing automation and capability. Initially, the system would investigate bugs and create tickets for human engineers with preliminary analysis. Once this advisory mode built sufficient trust, the team progressed to having the model attempt fixes and raise pull requests. This then revealed the need for an execution environment where the model could test its own code, leading to further iteration. The result—termed the "debug investigation and triage agent"—aims to have most overnight test failures automatically resolved by mid-2026, with only the most complex issues requiring human attention and those being clearly documented. The team is currently achieving 20-30% efficiency improvements in the divisions where this has been rolled out, targeting an eventual 50% efficiency gain across the full value chain spanning ten different use cases. ## Balancing Determinism and Probabilistic Reasoning A critical theme throughout the FDE engagements is the strategic decision of when to rely on LLM capabilities versus when to enforce deterministic rules. The automotive supply chain optimization system built for a customer in APAC demonstrates this principle clearly. The system addressed supply chain coordination challenges that previously required manual phone calls across multiple teams—manufacturing, logistics, and procurement—whenever disruptions occurred. The FDE team built a multi-layered system that leveraged LLMs for insights and coordination while maintaining deterministic guardrails for business-critical constraints. The architecture separated concerns explicitly: business intelligence queries that required synthesizing data from multiple sources (data warehouses, SharePoint, etc.) were handled through LLM orchestration with the model determining when to combine different data sources. However, core business rules—such as maintaining minimum numbers of suppliers for critical components or ensuring all materials remained covered—were enforced deterministically, never trusting the LLM to maintain these constraints. The system also incorporated a simulator for supply chain optimization that the LLM could invoke to run multiple scenarios. Rather than asking the LLM to directly optimize complex trade-offs between cost, lead time, and other factors, the system gave it access to the same simulation tools a human analyst would use. The LLM would then run multiple simulations (five in the demo, but potentially hundreds or thousands in production) and present the trade-offs with its recommendation, while the final decision remained subject to deterministic validation. This layered approach—LLMs for complex reasoning and synthesis, deterministic systems for critical constraints, and simulation tools for optimization—represents a mature pattern for production LLM systems in high-stakes environments. The application included transparency features like detailed reasoning explanations, tabular data for verification, and visual map representations, all designed to maintain human trust and enable oversight. ## From Engagement to Product: The Path to Generalization One of the most significant aspects of OpenAI's FDE model is its explicit focus on extracting product insights from customer engagements rather than building a sustainable consulting business. The evolution of the Swarm framework and Agent SDK illustrates this path from custom solution to platform capability. The journey began with Klarna's customer service deployment in 2023. The core challenge was scalability: manually writing prompts for 400+ policies was untenable. The FDE team developed a method for parameterizing instructions and tools, wrapping each intent with evaluation sets to enable scaling from 20 policies to 400+. This approach worked well enough that OpenAI decided to codify it as an internal open-source framework called Swarm. When this framework gained significant traction on GitHub, it validated market demand for the underlying patterns. The team then took the learnings to T-Mobile for an engagement described as "10x more complex" in terms of volume and policy complexity. The fact that the Swarm patterns still worked at this scale—with some extensions—provided further validation. These successive validations gave the FDE team the evidence needed to work with OpenAI's product organization to build what became the Agent SDK, with the recent Agent Kit release representing the visual builder continuation of that original 2023 pattern. This progression from custom solution → internal framework → validated at scale → product offering represents the idealized FDE-to-product pathway that OpenAI is trying to replicate across other domains. The team explicitly aims for a pattern where the first customer engagement might yield 20% reusable components, the next two or three engagements push that to 50% reusability, and at that point the solution gets pushed to OpenAI's scaled business operations for broader deployment. However, Jarvis emphasizes they're still "very much at the start of that journey" with most product hypotheses. ## Organizational Structure and Economics The FDE team's economics and organizational model differ significantly from traditional consulting. Rather than relying on services revenue as a primary income stream, OpenAI views FDE investments as bets on future product revenue and research insights. This positioning gives the team permission to say no to lucrative opportunities that don't align with strategic objectives. The team explicitly splits capacity along two axes: some engagements are driven by clear product hypotheses where OpenAI is seeking the perfect design partner for a specific capability (customer service, clinical trial documentation, etc.). Other engagements target industries with interesting technical challenges—semiconductors, life sciences—where the learning objective is more research-oriented, with the belief that improving model capabilities on these challenging problems will benefit OpenAI broadly even without immediate product extraction. This dual mandate creates organizational clarity but requires discipline. Jarvis describes the challenge of resisting short-term consulting revenue that could pull the team away from strategic focus—a failure mode he observed in consulting firms that aspired to become product companies but got trapped by services revenue. OpenAI's foundation as a research company that evolved into a product company helps maintain this discipline. The team operates explicitly as a zero-to-one function, not providing long-term operational support. Engagements typically involve either handing off to the customer's internal engineering teams or working alongside partners who will take over operations. This allows the FDE team to move from one hard problem to the next rather than getting bogged down in maintenance. ## Technical Patterns and Anti-Patterns Through their work, the FDE team has identified several technical patterns that consistently lead to success and anti-patterns to avoid. The most significant anti-pattern is generalizing too early—attempting to build horizontal solutions before deeply understanding specific customer problems. Jarvis describes cases where OpenAI identified features in ChatGPT that seemed like they'd generalize well to enterprises, then went looking for problems to solve with them. These efforts typically failed because they skipped the zero-to-one discovery phase. Conversely, almost every engagement that started by going "super deep on the customer's problem" yielded generalizable insights, even when generalization wasn't the initial objective. This validates the principle of solving specific, high-value problems thoroughly rather than building abstract solutions. On the technology side, Jarvis identifies the "metadata translation layer" as an underappreciated but crucial component of enterprise LLM deployments. Most FDE time goes into creating the translation layer between raw data and business logic that enables LLMs to make effective use of information—determining when to combine data warehouses with SharePoint contents, for instance. This echoes traditional data integration challenges but with new importance given LLMs' role as autonomous agents over that data. The team has also converged on a technical stack that consistently appears across engagements: orchestration of workflows, comprehensive tracing and telemetry, labeled data and evaluation frameworks, and guardrails for runtime protection. Jarvis notes that enterprises strongly prefer integrated solutions over point solutions for each component, as integration complexity adds significantly to the already substantial challenge of getting LLM applications to production. ## Industry-Specific Considerations The case studies reveal how LLM deployment patterns vary across industries based on regulatory environment, data characteristics, and organizational culture. Financial services (Morgan Stanley) required extensive trust-building in a regulated environment with high accuracy requirements, but the organization had clear KPIs and strong executive sponsorship that enabled ultimate success at scale. Manufacturing and semiconductors (the European chip manufacturer, APAC automotive company) presented deeply technical domains with expert users who needed significant capability depth. These engagements required extensive domain learning by the FDE team—embedding on-site for weeks to understand value chains and workflows. The payoff came from targeting the highest-value bottlenecks (verification for semiconductors, supply chain coordination for automotive) rather than attempting comprehensive solutions. The patterns also reveal that successful enterprise AI deployments target core business functions with genuinely high stakes rather than peripheral use cases. This strategic positioning ensures organizational commitment, makes impact measurement clear, and forces the development of production-grade reliability standards. Lower-stakes deployments might be easier but provide less organizational learning and commitment. ## Tooling and Infrastructure Insights Jarvis highlights several underappreciated tools in the LLMOps landscape. The OpenAI Playground—the pre-ChatGPT interface for interacting directly with the API—remains valuable for rapid use case validation. He describes using it for quick N=10 tests, like screenshotting web pages to validate browser automation use cases: if 7-8 attempts work, production viability is plausible. This low-friction validation approach helps avoid investing in complex infrastructure before confirming basic feasibility. Codex receives particular attention as the tool that enabled truly autonomous operation—Jarvis describes returning from four hours of meetings to find work completed autonomously for the first time. The semiconductor debugging agent built on Codex demonstrates how code generation models can be forked and extended with domain-specific tuning and execution environments to create specialized agent capabilities. Looking forward, Jarvis speculates that 2026 might be the "year of optimization" where the infrastructure built during the "year of agents" enables widespread fine-tuning of models for specific agentic use cases. The hypothesis is that with orchestration, evaluation, and guardrail infrastructure now established, organizations will be positioned to collect training data, rapidly label it, and fine-tune models for specialized domains like chip design or drug discovery, moving beyond general-purpose agent capabilities to highly optimized domain-specific systems. ## Challenges and Realistic Assessments The interview provides balanced perspective on challenges in enterprise LLM deployment. The six-month timeline to build trust at Morgan Stanley after achieving technical functionality in 6-8 weeks illustrates the gap between capability and adoption. The semiconductor engagement targeting 50% efficiency gains but currently achieving 20-30% shows the distance between vision and reality, even in successful deployments. Jarvis also acknowledges organizational challenges, describing a period from the first OpenAI Dev Day through much of 2024 where the company's focus swung heavily to consumer (B2C) applications, leaving the B2B-focused FDE team feeling their enterprise work wasn't receiving appropriate attention despite successful deployments at Morgan Stanley, Klarna, and others. This led to the open-sourcing of Swarm partly due to lack of internal interest. The pendulum swung back toward enterprise in late 2024, enabling the FDE business case approval and subsequent rapid growth, but this oscillation reflects real organizational tension in a company serving both consumer and enterprise markets. The admission that the team has "made plenty of mistakes" and the specific identification of premature generalization as the biggest error provides valuable learning for others building FDE practices. The emphasis on "doing things that don't scale"—invoking Paul Graham's startup advice—in an organization with OpenAI's resources and market position suggests this principle remains relevant even at significant scale. ## Strategic Implications for LLMOps This case study reveals several strategic insights for LLMOps practitioners. First, the separation between technical capability and production deployment is substantial and requires different skill sets—not just engineering excellence but domain understanding, trust-building with end users, and organizational change management. The FDE model explicitly addresses this gap through embedded, multi-month engagements. Second, the evolution from custom solution to product requires multiple validation cycles at increasing scale and complexity. The Swarm-to-Agent-SDK pathway needed validation at Klarna, further validation at T-Mobile at 10x complexity, and GitHub traction before product investment made sense. Organizations should expect this multi-step validation process rather than trying to productize from a single engagement. Third, the economics of high-touch enterprise AI deployment require either treating it as a strategic investment (OpenAI's approach) or ensuring engagement sizes justify the cost. OpenAI's focus on tens-of-millions to low-billions in value creation provides the economic headroom for substantial FDE investment. Smaller organizations would need to either target similarly high-value engagements or find ways to reduce deployment friction through better tooling and processes. Finally, the balance between deterministic and probabilistic components represents a crucial architectural decision. The pattern of using LLMs for complex reasoning and synthesis while enforcing business-critical constraints deterministically, with simulation tools for optimization, provides a template for production LLM systems in high-stakes environments. This architecture acknowledges both the power and limitations of current LLM capabilities while creating verifiable, trustworthy systems.

Start deploying reproducible AI workflows today