The Experimentation Phase Is Over: Key Findings from 1,200 Production Deployments

Alex Strick van Linschoten

Dec 19, 2025

•

3 mins

Contents

The Core Finding: Engineering Remains Central

A consistent pattern emerges across 1,200 case studies. Teams shipping reliable production LLM systems rely heavily on robust software engineering fundamentals. While AI-specific skills like context engineering are vital, they operate best when supported by solid infrastructure.

Generative AI does not bypass traditional engineering disciplines. Indeed, the challenge often lies less in model capability and more in the infrastructure required to deploy that capability reliably. Distributed systems, networking, durable execution, and state management provide the necessary foundation. These core skills allow specialized AI techniques to function at enterprise scale.

LinkedIn's AI team noted that while prompt engineering involves significant nuance, the surrounding system demands rigorous engineering. Building reliable products requires async pipelines, streaming response handling, and capacity management alongside model optimization.

GetOnStack's experience illustrates this balance. Their multi-agent system costs escalated from $127 to $47,000 in weekly costs over four weeks due to an undetected infinite conversation loop. They subsequently invested six weeks in building message queues, circuit breakers, and monitoring. These standard distributed systems components proved essential for controlling their AI agents.

Once this engineering foundation is established, however, distinct patterns emerge in how teams handle the unique challenges of LLMs. Beyond the basics of reliability and uptime, specific architectural choices set successful deployments apart.

Six Themes Emerging from Production

1. From Demos to Real Business Outcomes

The proof-of-concept era is visibly ending. The database now regularly tracks systems handling genuinely critical business functions with quantified impact.

Stripe's fraud detection improved card-testing accuracy from 59% to 97% for their largest merchants, on infrastructure processing 1.3% of global GDP. nib's chatbot generates $22 million in documented savings across 4 million interactions. Ramp's policy agent handles 65% of expense approvals autonomously.

The organisations extracting real value aren't the ones with the most innovative demos. They're the ones doing the unglamorous engineering work: evaluation pipelines, guardrails, designing for uncertainty.

2. Context Engineering > Prompt Engineering

If 2023 was the year of prompt engineering (learning how to talk to models), 2024-2025 have marked the rise of context engineering: learning how to architect the information models consume.

The core insight: just because you can fit everything into a million-token context window doesn't mean you should. Manus found that "context rot" begins between 50k–150k tokens regardless of theoretical limits. Dropbox encountered "analysis paralysis" when exposing too many tools. The model spent more time deciding which tool to use than actually acting.

The winning pattern is "just-in-time" context injection: dynamically assembling only what's needed based on the user's immediate state. Shopify collocates instructions with tool outputs rather than loading everything upfront. Elyos AI actively removes context that's served its purpose, calling it "just-in-time in, just-in-time out."

Leaner contexts make models smarter, along with being faster and cheaper. The million-token window is a ceiling to stay under, rather than a feature to exploit.

3. Guardrails Move from Prompts to Infrastructure

The limitations of prompt-based guardrails are now well understood. As Oso's framework puts it: "what 1997 was for SQL injection, 2025 is for prompt injection."

The most reliable guardrails are implemented in code, not prompts. Oso's "session tainting" automatically blocks agents from using communication tools after they've touched both untrusted content and sensitive data, regardless of what the model tries to do. Wakam implements dual-layer permissions: one controlling what humans see, another controlling what agents access. Komodo Health stripped their LLM of all authorisation logic entirely. The APIs the LLM calls handle security, and if they return 403 Forbidden, the LLM simply says "I can't access that."

The pattern is clear: architectural constraints provide guarantees that prompt engineering cannot.

4. Evaluation Grows Up

The phrase "evals are the new unit tests" has become a mantra for good reason. Ramp's approach is instructive: they discovered that user feedback requires careful interpretation. Finance teams might approve expenses that technically violate policy out of convenience. Treating user actions as ground truth would bias the system toward excessive leniency. Their solution: "golden datasets" independently labelled to establish correct decisions based solely on available information.

Shadow testing before live deployment has become standard for high-stakes applications. Ramp runs agents in "shadow mode" on real transactions, comparing predictions to human decisions. They only enable live actions once shadow accuracy hits specific thresholds, testing guardrails on real financial data without risking a dollar.

Circuit breakers acknowledge that perfect reliability is impossible. Cox Automotive implements hard limits on cost and conversation turns; when thresholds are exceeded, the system gracefully hands off to humans. The goal isn't preventing all failures but ensuring failures are detected quickly and handled gracefully.

5. MCP Quietly Stabilises

While agent harnesses and memory systems remain in flux, the Model Context Protocol has achieved something unusual: relative stability. Enterprises are building servers, SaaS companies are exposing APIs, and a practical knowledge base is accumulating.

Loblaws wrapped 50+ internal APIs so their orchestration agent could handle complex workflows like shopping for recipe ingredients. Sentry's MCP server handles 60 million requests per month with a three-person team.

But the database also documents MCP's real challenges. CloudQuery discovered that tool naming matters more than expected: a tool named example_queries sat unused for two weeks; renaming it to known_good_queries with a verbose description made it frequently used. Tools are prompts. Digits explicitly rejected MCP for production: "all our data is internal and major security questions remain unresolved."

MCP is settling into its appropriate role as infrastructure for connecting agents to existing systems, rather than a solution in itself.

6. Stop Waiting for the Frontier

There's a seductive narrative that the next model release will solve your production problems. The database tells a different story.

Cubic's AI code review agent got worse when they gave it more tools. The agent became confused, generating false positives until developers stopped trusting it. Instead of a smarter model, they needed streamlining: fewer capabilities, better results. Alexa's team found that adding more few-shot examples decreased accuracy because the model overfitted to examples instead of reasoning.

Stripe's compliance lead describes end-to-end automation as a "fairytale." Instead, they decomposed complex reviews into bite-sized tasks on strict "rails," ensuring that even unexpected model behaviour is contained to a single task.

The teams shipping systems that work aren't waiting for GPT-6. Robinhood reduced P90 latencies from 55 seconds to under one second through hierarchical tuning (prompt optimisation, trajectory tuning, LoRA fine-tuning) rather than waiting for faster models. Care Access cut costs 86% by separating static medical records (cached) from dynamic questions: pure infrastructure engineering.

The allure of the frontier is that it promises to make engineering unnecessary. It doesn't. Better models shift where challenges lie; they don't eliminate them.

What This Means for Practitioners

The implication is straightforward: invest in engineering.

The organisations winning with LLMs are distinguished by their infrastructure, their evaluation practices, their operational discipline, and their willingness to treat AI systems with the same rigour they'd apply to any mission-critical technology. Model access and prompt libraries matter far less than these fundamentals.

The experimentation phase has ended. The engineering phase has begun.

[Read the full report →] for detailed case studies on each theme, including specific architectural patterns, failure modes, and implementation details from companies including Stripe, Amazon, DoorDash, Shopify, LinkedIn, Manus, Ramp, Cox Automotive, and dozens more.

The LLMOps Database is maintained by ZenML. We catalogue production LLM deployments to help practitioners learn from what's actually working in the field. Explore the database →

Start deploying AI workflows in production today

Enterprise-grade AI platform trusted by thousands of companies in production

Book a Demo

Use Open Source