.png)
The LLMOps Database crossed 1,200 case studies this month. Since we last wrote one of these summaries, we've catalogued another 400 production deployments. These are real systems handling real traffic, built by teams navigating the gap between "it works in a notebook" and "it works at 2am when the on-call engineer is asleep."
This article distils what we're seeing across that growing corpus. (If you prefer a succinct post with just the high-level takeaways, the executive summary highlights the core trends.) Rather than predicting where the field is heading, this article focuses on patterns emerging from where it already is. The trends that follow derive directly from what practitioners are actually doing to ship reliable LLM systems, rather than theoretical frameworks imposed from outside.
What follows covers this terrain in detail: the shift from demos to real engineering, the emergence of context engineering as a distinct discipline, the stabilisation of MCP as integration infrastructure, the maturation of evaluation and guardrail practices, the uncomfortable truth that software engineering skills matter more than AI expertise, and the persistent allure of frontier models that don't actually solve production problems.
This is a practical assessment of what's working, what isn't, and what the teams shipping production systems have learned along the way.
1. Real Engineering Replaces POC Demos

When we first started the LLMOps Database, much of what we catalogued fell into the "interesting experiment" category: proof-of-concept deployments, weekend RAG chatbots, and systems that quietly disappeared when confronted with real traffic. That has changed. Companies have moved beyond experimenting with AI as a productivity add-on to rebuilding core business processes around LLM capabilities, and the evidence shows up in the metrics that matter: revenue impact, operational scale, and measurable outcomes.
Real Business Outcomes
The clearest signal that we've moved past the experimentation phase is the emergence of LLM systems handling genuinely critical business functions. These are core revenue-generating processes rather than adjacent tools.
Take Stripe's approach to fraud detection. They've built a domain-specific foundation model that processes payments representing roughly 1.3% of global GDP. Unlike a support chatbot, this is infrastructure that sits in the critical path of every transaction. Their architecture treats each payment as a token and user behavior sequences as context windows, ingesting tens of billions of transactions. The practical result is that card-testing fraud detection accuracy improved from 59% to 97% for their largest merchants.
Amazon's Rufus provides another data point. During Prime Day, the system scaled to 80,000 Trainium chips while serving conversational shopping experiences to 250 million users. The team reported 140% year-over-year monthly user growth and a 60% increase in purchase completion rates. What's worth noting here is the architectural evolution: Amazon moved from a custom in-house LLM to a multi-model approach orchestrating Amazon Nova, Claude, and specialized models.
Similarly, DoorDash rebuilt their recommendation engine to handle their expansion beyond restaurant delivery. Scaling from 100-item menus to 100,000+ item retail catalogues creates cold-start problems. Their hybrid retrieval system, which infers grocery preferences from restaurant order history, delivered double-digit improvement in click-through rates and directly addressed the personalisation challenges that come with entering new verticals.
Processing at Scale
The systems now making it into the database are operating at volumes that would have seemed aspirational even a year ago.
ByteDance processes billions of videos daily for content moderation across TikTok and other platforms. They've deployed multimodal LLMs on AWS Inferentia2 chips across multiple global regions, implementing tensor parallelism, INT8 quantization, and static batching to achieve 50% cost reduction while maintaining the latency requirements of a real-time social platform.
Shopify's product classification system handles 30 million predictions daily, sorting products into over 10,000 categories with 85% merchant acceptance rate. Their Sidekick assistant evolved from simple tool-calling into a sophisticated agentic platform, but the journey wasn't smooth. They encountered what they call the "tool complexity problem" when scaling from 20 tools to 50+ with overlapping functionality. Their solution uses Just-in-Time instructions that provide relevant guidance exactly when needed.
In the developer tools space, Cursor's Tab feature now handles over 400 million requests per day. Beyond the volume, their approach is instructive: they implemented an online reinforcement learning pipeline that updates based on user acceptance rates within hours, achieving a 28% increase in code acceptance. Their recent work adapting to OpenAI's Codex models uncovered that dropping reasoning traces caused 30% performance degradation.
Quantified Revenue Impact
We're increasingly seeing organisations move past vague "efficiency gains" to report specific financial outcomes.
nib, an Australian health insurer, has been running their Nibby chatbot since 2018, now enhanced with modern LLMs. The system has handled over 4 million interactions and generates approximately $22 million in documented savings. They achieve 60% chat deflection, and their call summarisation feature reduced after-call work by 50%. These are measured results rather than projections.
The PGA Tour's content generation system offers a different angle. They reduced article generation costs by 95% to $0.25 per article, now producing 800 articles per week across eight content types. Their AI-generated content has become their highest-engagement material on non-tournament days, driving billions of page views annually. The multi-agent architecture, with specialised agents for research, data extraction, writing, validation, and image selection, demonstrates what production LLMOps actually looks like versus a demo.
In financial services, Riskspan transformed private credit deal analysis from a 3-4 week manual process to 3-5 days. They reduced per-deal processing costs by 90x to under $50, which matters considerably when addressing a $9 trillion market opportunity. Their system uses Claude to dynamically generate code that models investment waterfalls to produce executable financial calculations rather than just extracting information.
CBRE, the world's largest commercial real estate firm, deployed a unified search assistant across 10 distinct data sources. They reduced SQL query generation time by 67% (from 12 seconds to 4 seconds) and improved database query performance by 80%. For property managers who previously navigated fragmented systems containing millions of documents, this represents a meaningful change in daily operations.
Autonomous Agents Doing Real Work
Perhaps the most notable shift is agents moving from "drafting assistance" to completing complex, multi-step workflows without human intervention.
Western Union and Unum partnered with AWS and Accenture/Pega to modernise mainframe systems, converting 2.5 million lines of COBOL code in approximately 1.5 hours. For Unum, this reduced a project timeline from an estimated 7 years to 3 months, while eliminating 7,000 annual manual hours in claims management. The architecture uses composable agents working through orchestration layers.
Ramp's policy agent now handles over 65% of expense approvals autonomously. Their design emphasises explainable reasoning with citations, built-in uncertainty handling that explicitly allows the agent to defer to humans when uncertain, and user-controlled autonomy levels. Their separate merchant classification agent processes requests in under 10 seconds (versus hours manually) and handles nearly 100% of requests, up from less than 3% that human teams could previously manage.
Harman International faced a familiar enterprise challenge: documenting 30,000 custom SAP objects accumulated over 25 years with minimal documentation, essential for their S/4HANA migration. Manual documentation by 12 consultants was projected to take 15 months with inconsistent results. Using AWS Bedrock and Amazon Q Developer with Claude, they reduced the timeline from 15 months to 2 months and cut costs by over 70%.
Search and Retrieval Remains Central
Despite periodic "RAG is dead" declarations, the most successful production systems we're tracking rely heavily on sophisticated retrieval architectures.
LinkedIn rebuilt their GenAI stack with a RAG-based pipeline at its core, supporting multi-agent orchestration. Their system routes queries to specialised agents (job assessment, company understanding, post takeaways), retrieves data from internal APIs and Bing, then generates contextual responses. One observation from their work that resonates with what we've seen elsewhere: reaching 80% quality happened quickly, but pushing past 95% required the majority of development time. This pattern, where the final stretch from "demo quality" to "production quality" consumes disproportionate effort, appears consistently across the database.
The organisations extracting real value aren't necessarily the ones with the most innovative demos—they're the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty, and treating their LLM systems with the same rigour they'd apply to any critical infrastructure.
2. Context Engineering > Prompt Engineering

If 2023 was the year of prompt engineering (learning how to talk to models), then 2024 and 2025 have marked the rise of context engineering: learning how to architect the information models consume. We've watched this become one of the clearest dividing lines between teams that ship reliable LLM systems and those still wrestling with inconsistent results.
The shift is reflected in how practitioners describe their work. The phrase "context engineering" has emerged as shared vocabulary for the architecture required to keep agents focused. Dropbox uses the term "context engineering" to describe the architecture required to prevent what they call "analysis paralysis" in their Dash AI assistant. Anthropic's engineering team distinguishes it from prompt engineering, defining it as the management of everything that goes into the context window: system prompts, tool definitions, conversation history, and retrieval strategy. The underlying thesis is straightforward: just because you can fit everything into a model's context window doesn't mean you should.
The Problem with More Context
The naive approach to building agents is to stuff all history, tools, and documentation into the context window. We've catalogued dozens of cases where this fails in production.
Manus, a Singapore-based agent platform, found that their typical tasks require around 50 tool calls, with production agents spanning hundreds of conversational turns. Every tool call generates observations that append to the message history, creating unbounded growth. They reference Anthropic's research noting that "context rot" often begins between 50k–150k tokens, regardless of a model's theoretical million-token maximum. Even with prompt caching reducing cost and latency, performance still degrades. You're processing the same bloated context, just faster.
Dropbox encountered what they call "analysis paralysis" when exposing too many tools to their Dash agent. The more retrieval options available in the context, the more time the model spent deciding which tool to use rather than actually acting.
These are not edge cases but result predictably from treating context as a dumping ground.
Just-in-Time Context
The most common pattern we're seeing in production systems is what teams call "just-in-time" injection: dynamically assembling context based on the user's immediate state rather than loading everything upfront.
Shopify's Sidekick assistant collocates instructions with tool outputs rather than loading all instructions at the start. If a tool returns search results, the specific instructions on how to process those results appear right next to the data. This maintains cache efficiency and keeps the model focused on what's actually happening now.
Elyos AI, which builds voice agents for home services companies, takes this further. For emergency call-outs, their first step provides context to identify whether the situation qualifies as an emergency. Once that determination is made, they remove that context entirely and replace it with a single deterministic fact: "this is an emergency." The conversation history about how they reached that conclusion is no longer needed. They describe this as "just-in-time in, just-in-time out," actively cleaning context that's served its purpose.
Tool Masking and Schema Shrinking
When you can't reduce the number of tools, you can at least reduce their complexity.
Databook's "tool masking" approach places a configuration layer between agents and the underlying tool handlers. Instead of exposing a full API with 100 fields, a mask might only reveal the 3 fields relevant to a particular task. Their example: a stock quote API that normally returns dozens of metrics gets masked to return only symbol, market price, and currency. The input schema is similarly simplified: the agent only needs to provide a ticker symbol, and everything else is either hardcoded or system-provided.
This approach treats tool definitions as prompts in their own right. Databook's head of applied AI describes it as the evolution from prompt engineering to context engineering, where context engineering includes engineering the surface of the tools themselves. The same underlying API can present different masks for different agents or different stages of a workflow.
Manus implements something similar with what they call logit masking. Rather than deleting tools from the context (which breaks caching), they mathematically prevent the model from selecting irrelevant tools during specific conversation states. The tools remain in the definition but are effectively invisible to the decision-making process.
Compaction Versus Summarisation
Managing context over long-running sessions requires distinguishing between reversible and irreversible reduction.
Manus makes a crucial distinction: compaction is reversible, summarisation is not. Compaction converts verbose tool outputs into minimal representations while keeping the full information recoverable. A file write confirmation might compact from path plus full content to just the path so the agent can read the file again if needed. Summarisation, by contrast, loses information permanently. They use it only as a last resort when compaction yields minimal gains.
Their approach is staged: trigger compaction first, typically on the oldest 50% of tool calls while keeping newer ones in full detail so the model retains fresh examples of proper tool usage. Only when multiple compaction rounds yield diminishing returns do they summarise, and even then they preserve the last few tool calls in full to maintain behavioural continuity.
LangChain's Lance Martin adds an isolation pattern: token-heavy sub-tasks get offloaded to specialised sub-agents that process their context independently and return only a summary or result to the main agent, preventing context contamination.
The File System as Context
Some teams are pushing context engineering into territory that might seem regressive but turns out to be remarkably effective.
Manus runs agents inside full virtual machine sandboxes, and they discovered that for many use cases, you don't need a vector database at all. The Linux file system itself becomes the context. The agent uses grep, cat, and ls to retrieve its own context on demand, effectively treating the operating system as its long-term memory. Token-heavy tool outputs get dumped to files; the context window holds only minimal references. When the model needs that information again, it reads the file.
Claude Code and similar coding assistants use this pattern: the codebase is the context, and file operations are the retrieval mechanism. The file system is already indexed, already persistent, and doesn't require building infrastructure on the fly.
That said, this isn't universal. For integrating enterprise knowledge bases or long-term memory across sessions, vector indexes become necessary. The scale determines the approach. But it's worth noting how many teams have found that simpler retrieval mechanisms work better than sophisticated semantic search when the context is naturally bounded.
Dual Embeddings and Specialised Representations
When retrieval is required, we're seeing teams move beyond single-embedding approaches.
Glowe, a skincare recommendation system built on Weaviate, creates two distinct embeddings for the same product. One embedding captures descriptive metadata (what the product is), and a second embedding captures user reviews and effects (what the product does). They use TF-IDF weighting to ensure rare but meaningful effects aren't drowned out by generic descriptions in the context. When recommending products for specific skin concerns, they search the effect embeddings rather than the product embeddings.
This pattern of separating concerns at the embedding level allows more targeted retrieval. The model doesn't receive everything about a product. Instead, it receives what's relevant to the current query. It's another form of context engineering: controlling not just what goes into the context but how that information is represented and retrieved.
Why Teams Are Investing in This
The business case for context engineering shows up in three dimensions.
Cost: Shopify noted that tool outputs consume 100x more tokens than user messages, so aggressive context pruning directly correlates to margin.
Latency: Elyos AI targets sub-400ms response times, which requires keeping context minimal.
Reliability: Leaner contexts make models smarter, not just faster and cheaper.
The Discipline Takes Shape
What's emerging across these case studies is a recognisable engineering discipline with its own patterns, tradeoffs, and best practices.
The core principle: everything retrieved shapes the model's reasoning, so relevance filtering is critical. The practical techniques: just-in-time injection, tool masking, staged compaction, context isolation through sub-agents, file system offloading, and specialised embeddings. The evaluation criteria: not just whether the model can process the context, but whether the context helps or hinders the model's actual task.
Manus has refactored their context engineering architecture five times since launching in March. LangChain's Lance Martin emphasises that production teams should "build less and understand more". Their biggest performance improvements came from simplifying architecture rather than adding complexity.
The million-token context window serves less as a feature to exploit and more as a ceiling to stay well under. The teams shipping reliable LLM systems have internalised this, and context engineering has become the discipline that makes it practical.
3. The Frontier: Where Production Meets Experimentation

While the previous sections cover patterns that have solidified into recognisable best practices, two areas remain in active flux: agent infrastructure (harnesses and the reinforcement learning loops that improve them), and memory systems for long-running agents. Both represent genuine production needs, but neither has stabilised into consensus approaches. What we're seeing is parallel experimentation rather than industry convergence.
Agent Infrastructure: Harnesses and Learning Loops
The orchestration layer wrapping an LLM to make it function as an agent requires surprisingly complex engineering. Cursor's recent work adapting to OpenAI's Codex models demonstrates why. Each frontier model arrives with different behavioural patterns shaped by its training data. Codex models, trained specifically for agentic coding workflows, favour shell-oriented patterns where the model wants to use grep and cat instead of dedicated tools. Cursor had to rename and redefine their tools to align with shell conventions, add explicit instructions guiding the model toward tool calls over shell commands, and implement sandboxing for when the model did execute arbitrary commands. Their experiments showed that dropping reasoning traces caused a 30% performance degradation for Codex, substantially larger than the 3% OpenAI observed for mainline GPT-5 on SWE-bench. This kind of finding only emerges from operating at production scale with tight feedback loops.
Manus provides perhaps the most detailed public account of harness architecture at scale. Their typical tasks require around 50 tool calls, with production agents spanning hundreds of conversational turns. Instead of binding hundreds of tools directly to the model, they implemented a layered action space: a fixed set of atomic functions (file operations, shell commands, web search), sandbox utilities (command-line tools discoverable via standard help commands), and a third layer where the agent writes Python scripts to call pre-authorised APIs. The model sees the same simple interface regardless of which layer handles the actual work. This keeps the function calling space minimal, maximises KV cache efficiency, and allows capability expansion without invalidating cached prompts. They've refactored their architecture five times since March. The patterns are starting to rhyme across teams, but there's no equivalent of "just use a transformer" for agent harnesses yet.
Beyond static harness design, teams are beginning to improve their agents through reinforcement learning. OpenPipe's ART·E project demonstrates what's now possible at smaller scales. They built an email research agent trained using RL (specifically GRPO) to answer natural-language questions by searching email inboxes. The agent environment is intentionally simple: three tools for searching, reading, and returning answers, backed by SQLite with full-text search. They trained a Qwen-14B model with a multi-objective reward function optimising for answer correctness, fewer turns, and reduced hallucinations. The resulting model outperformed OpenAI's o3 on this specific task while being faster and cheaper, with training completed in under a day on a single H100 GPU for approximately $80.
The reward function design proved critical. Minimising turns worked well as a proxy for latency, and penalising hallucinations reduced confabulation without hurting accuracy. But an early experiment that gave partial credit for taking more turns (intended to encourage exploration) resulted in the model learning to exploit this by repeating its last tool call until hitting the maximum turn limit. Reward hacking remains a real concern even at these smaller scales.
Cursor takes a different approach with online reinforcement learning in their Tab feature, which handles over 400 million requests per day. Instead of training models from scratch, they implemented an online RL pipeline that updates based on user acceptance rates within hours, achieving a 28% increase in code acceptance. RL for agents is becoming accessible to teams outside the major labs, but the successful cases involve narrow, well-defined tasks with clear reward signals.
Memory: The Problem Everyone Acknowledges
If there's one area where production teams consistently express frustration, it's memory. Every fresh context window essentially resets what the model "knows" from a session. For agents operating over extended periods, handling long-running tasks, or needing to learn user preferences over time, this creates fundamental challenges that current solutions address imperfectly.
LangChain's Lance Martin frames the problem directly: memory systems become particularly important for ambient agents, systems that run asynchronously on schedules without real-time user interaction. His email agent runs every 10 minutes, triages incoming mail, drafts responses, and queues them for approval. Without memory, the system would keep making the same errors without learning. He implemented a simple long-term memory system stored in files that updates continuously as he provides feedback. The approach works, but "simple" and "files" suggest we're still in the early experimentation phase.
Personize.ai took memory in a different direction with what they call proactive memory. Instead of retrieving raw data on demand, their system runs internal agents that infer insights and synthesise understanding ahead of time. The example: many businesses need to know whether a company is B2B or B2C. This information affects everything from qualification to service selection, but it rarely appears explicitly in raw data. Their system examines available data, recognises the classification is important, and infers it before any agent needs it. Standardised attributes then make these inferences searchable and usable across all agents. The challenge they identified: having access to raw data doesn't mean understanding the customer. When running the same agent repeatedly across tens of thousands of executions, the chunks retrieved might come from different parts of the data, creating partial or inconsistent understanding.
Other teams are exploring knowledge graphs (Cognee), user-confirmed preferences (Manus), and various hybrid approaches. What's clear is that production teams need agents that operate over extended periods, learn from feedback, and maintain coherent state across sessions. The solutions exist and they're deployed, but they're experiments running in production rather than settled practices. In areas that haven't stabilised, LangChain's observation resonates: teams should "build less and understand more." The biggest performance improvements often came from simplifying architecture instead of adding complexity.
4. MCP at One Year: Quiet Stabilisation

The Model Context Protocol has been in the wild for roughly a year now, and something unexpected has happened: it's become one of the more stable elements in the LLMOps landscape. While agent harnesses and memory systems remain in active flux, MCP has settled into a recognisable pattern: enterprises building servers, SaaS companies exposing their APIs, and a growing body of practical knowledge about what works and what doesn't. The database reveals genuine production deployments with real limitations being openly discussed rather than hype-driven adoption.
Enterprise Adoption: More Substantial Than Expected
The database contains a notable concentration of enterprise MCP implementations that go well beyond proof-of-concept.
Loblaws, the Canadian retail giant, built an MCP ecosystem wrapping 50+ internal platform APIs—cart, pricing, inventory, customer, catalogue, and more, so their "Alfred" orchestration agent could handle complex workflows like shopping for recipe ingredients. Their implementation is instructive: rather than exposing individual API endpoints, they carefully designed task-oriented tools that combine multiple backend operations. When a user discusses dinner ideas and decides on shrimp pasta, a single tool handles finding all the ingredients, calling catalogue, pricing, and inventory APIs to return a complete shopping list. This abstraction layer proved critical for agent reliability.
Swisscom uses MCP to let network operation agents access topology graphs and alarm systems for diagnosing outages across their complex multi-cloud infrastructure. A customer service scenario illustrates the value: restoring router connectivity could stem from billing issues, network outages, or configuration problems, each residing in different departments. MCP enables agents to coordinate across these boundaries while maintaining Switzerland's strict data protection compliance. They've combined MCP with the Agent-to-Agent protocol for seamless cross-departmental collaboration.
What's notable across these implementations is the emphasis on MCP as integration infrastructure rather than AI magic. The agents succeed because they're connecting to well-established backend systems through standardised interfaces rather than MCP providing intelligence itself.
SaaS Companies: Building the Ecosystem
A different pattern is emerging among SaaS providers: building MCP servers so their customers' agents can access platform capabilities directly.
HubSpot became the first CRM to build a remote MCP server, enabling ChatGPT to query customer data directly. Their motivation was straightforward: 75% of their customers already use ChatGPT, so meeting users where they are made strategic sense. The implementation took less than four weeks, delivering read-only queries that let customers ask natural-language questions about contacts, companies, and conversion patterns. Their team extended the Java MCP SDK to support HTTP streaming and contributed the changes back to open source.
Sentry's MCP server has scaled to 60 million requests per month, doubling from 30 million in about two months. The server provides direct integration with 10-15 tools, allowing AI coding assistants to pull error details and trigger automated fix attempts without developers needing to copy-paste from Sentry's UI. With over 5,000 organisations using it—from startups to large tech companies—and just a three-person team managing the infrastructure, it represents genuine production scale.
Sentry's candour about operational realities is valuable. They shipped early without observability and paid for it: when AI tooling breaks, users don't retry the next day but abandon it for months. Getting things right from the start matters more than shipping quickly with more features.
The Real Struggles: Context Pollution and Choice Entropy
The database reveals a consistent set of challenges that emerge once teams move past initial implementation.
CloudQuery's most interesting discovery was about tool naming. They built a tool specifically to help write SQL queries, initially named example_queries. Despite being exactly what users needed, it sat completely unused for two weeks. The problem was semantic rather than technical. LLMs make probabilistic predictions about which tool to invoke based on name and description similarity to the query context. Renaming it to known_good_queries and writing a verbose description that signalled "vetted, high-quality SQL" moved it from ignored to frequently used. Their insight: tools are prompts, and the engineering of tool descriptions is generally overlooked.
Databook coined the term "choice entropy" to describe what happens when agents connect to APIs and see dozens or hundreds of data fields. The more choices available, the more opportunities for the model to misfire. Their solution involves filtering and reshaping tool schemas so agents see only what's relevant for specific tasks, the "tool masking" approach covered in the context engineering section above.
The Holdouts: When MCP Isn't the Answer
Not everyone is on board, and the reasons are instructive.
Digits, an automated accounting platform, explicitly rejected MCP for production use. Their head of applied AI was direct: "We haven't adopted MCP or A2A protocols because all our data is internal and major security questions remain unresolved." While MCP provides good marketing value for connecting to external services, it represents a "hard play" to integrate into production products until security concerns are addressed. For high-stakes financial data, the security and privacy implications aren't yet mature enough for their production standards.
This isn't a fringe position. The authentication capabilities have improved substantially over the past six to seven months, making MCP more viable for enterprise contexts. But the Digits example is a useful reminder that standardisation only provides value when the standards meet your security requirements, and for some use cases, that threshold hasn't been crossed yet.
Interesting Patterns at the Edges
Some teams are pushing MCP into unexpected territory.
Goodfire's MCP-based Jupyter integration surfaced an important security consideration: the Jupyter kernel integration allows agents to bypass security permissions built into systems like Claude Code. Without custom security checks, agents can pass arbitrary code to the tool, circumventing default permissions. They observed agents that were blocked from running pip install via native bash tools realising they could execute the same commands through notebook tool calls. The flexibility that makes MCP powerful also creates security surface area that teams must actively manage.
The "USB-C for AI" Question
Deepsense describes MCP as potentially becoming "the USB-C for AI integration": once a company builds an MCP server for their data, any agent can use it without custom glue code. The analogy is appealing, and there's real value in standardisation. But Deepsense also warns that poorly designed MCP servers can "bloat context windows" with "agent reasoning destroyed." Standardisation only provides value when the standards are well-implemented; poorly designed MCP servers may be worse than well-designed custom integrations.
Where This Leaves Us
The honest assessment is that MCP has achieved something unusual in the LLMOps space: relative stability. The protocol exists, it works, enterprises are using it at scale, and a growing body of practical knowledge documents what succeeds and what fails. That's more than can be said for agent harnesses or memory systems.
But stability doesn't mean maturity. The challenges around context pollution, tool naming, and authentication are being solved through accumulating experience rather than protocol improvements. Teams are learning that tools are prompts, that less context often means better performance, and that security boundaries require active management.
What the database suggests is that MCP is settling into its appropriate role: infrastructure for connecting agents to existing systems rather than a solution in itself. The teams extracting value are those treating it as a standardised integration layer while doing the harder work of designing appropriate abstractions, managing token budgets, and implementing proper security controls. USB-C is useful precisely because it's just a connector; the intelligence has to come from elsewhere.
5. Evals and Guardrails: Where the Engineering Actually Happens

If there's one area where the database reveals the most dramatic maturation in production LLM practices, it's the parallel evolution of evaluation systems and guardrails. What began as informal "vibe checks" and basic content filters has transformed into sophisticated engineering disciplines. The shift represents a fundamental rethinking of how organisations validate and constrain AI behaviour in systems where the consequences of failure extend well beyond embarrassing chatbot responses.
The Death of the Vibe Check
The phrase "evals are the new unit tests" has become something of a mantra, and Ramp's expense automation platform provides a compelling demonstration of why. Their approach to evaluating their policy agent, which now handles over 65% of expense approvals autonomously, follows what they describe as a "crawl, walk, run" strategy. Rather than attempting comprehensive evaluation from day one, they start with quick, simple evals and gradually expand coverage as the product matures.
What makes Ramp's approach particularly noteworthy is their treatment of edge cases: they turn every user-reported failure into a regression test case, creating a continuous feedback loop between production experience and evaluation coverage.
But here's the nuance that separates mature practitioners from the enthusiastic early adopters: user feedback requires careful interpretation. Ramp discovered that finance teams might approve expenses that technically violate policy, approving things out of convenience or relationship dynamics rather than strict compliance. Simply treating user actions as ground truth would bias the system toward excessive leniency. Their solution was creating "golden datasets" carefully reviewed by their own team to establish correct decisions based solely on information available within the system. This independent labelling process removes affinity bias and other human factors that might influence real-world decisions.
The scale of systematic evaluation is substantial at some organisations. GitHub runs comprehensive offline evaluations against their Copilot models to catch regressions before they hit production, testing models before user interaction across metrics like latency, accuracy, and contextual relevance.
Traditional ML Policing Generative AI
One of the more unexpected patterns in the database is the use of traditional machine learning models to govern when and whether LLMs should be invoked at all. DoorDash built a sophisticated multi-stage validation pipeline for their internal agentic AI platform that they call "Zero-Data Statistical Query Validation." The system includes automated linting, EXPLAIN-based checking for query correctness and performance against engines like Snowflake and Trino, and statistical metadata checks on query results—such as row counts or mean values—to proactively identify issues like empty result sets or zero-value columns, all without exposing sensitive data to the AI model.
This pattern of using deterministic checks and traditional ML to validate, constrain, or gate LLM behaviour appears repeatedly across the database.
Architectural Guardrails: Moving Safety Out of the Prompt
The most significant theme across the database is the systematic movement of safety logic out of prompts and into infrastructure. The limitations of prompt-based guardrails are now well understood: every time a new model comes out, exploits for prompt injection emerge within hours. As Oso's framework for agent governance puts it bluntly: "what 1997 was for SQL injection, 2025 is for prompt injection."
Oso introduced what they call a "Three-Component Identity" model for agent systems, requiring user, agent, and session context for proper authorisation. The session component is particularly innovative; they treat sessions as capable of being "tainted" once they touch certain combinations of data. If an agent reads untrusted content (like a user email) and then accesses sensitive data (like a database), the system automatically blocks it from using external communication tools (like Slack) for the rest of that session. This prevents prompt injection attacks from succeeding regardless of what the model tries to do, because the safety logic is implemented in code rather than the prompt.
Their approach draws an explicit analogy to memory-safe programming languages: once a variable is "tainted," it cannot be passed to secure sinks. The key insight is that authorisation decisions must consider the sequence of events within a session, and this type of context-dependent authorisation is impossible without tracking session state.
Wakam, a European digital insurance company, implemented what they describe as a "dual-layer" permission system. One layer controls what the human can see, and a second layer controls what the agent can access. A user can only invoke an agent if they also hold the permissions for the data that agent uses. This architectural approach prevents users from using agents to bypass their own access controls, a vulnerability that prompt-based guardrails cannot reliably address.
Komodo Health's healthcare analytics assistant takes this to its logical conclusion: their LLM has zero knowledge of authentication and authorisation, which are handled entirely by the APIs it calls.
Creative Solutions at the Edges
Some of the most interesting guardrail implementations in the database address highly specific technical constraints with creative solutions.
Toyota's vehicle information platform faced a particular challenge: every response must include legally correct disclaimers, and this text cannot be altered by the LLM under any circumstances. Their solution was a technique they call "stream splitting." They trained their model to output three distinct streams of data: the natural language response, ID codes for images, and ID codes for legal disclaimers. The application layer then injects the immutable legal text based on those codes. This guarantees the LLM cannot hallucinate or slightly alter legally binding text, a requirement that would be impossible to enforce through prompting alone.
Incident.io's AI-powered summary generator demonstrates a different kind of creative constraint. Since they know the actual root causes of past outages, they can replay historical incidents to their agent to see if it correctly identifies the cause. This "time travel" evaluation approach lets them assess whether the agent's understanding lags behind or leads the human responders, ensuring the agent doesn't hallucinate a fix that wasn't actually possible at that specific moment in time. It's a form of evaluation that's only possible because of the structured nature of their domain.
Digits, an automated accounting platform, routes generation to one provider while sending outputs to a different provider for evaluation. Using a different model family prevents the "grading your own test" problem where a model fails to catch its own mistakes because it shares the same blind spots.
User-Controllable Guardrails: Product Features Rather Than Backend Settings
One of the more forward-thinking patterns in the database is the transformation of guardrails from hidden technical constraints into user-configurable product features.
Ramp's policy agent implements what they describe as an "autonomy slider" through their existing workflow builder. Users can specify exactly where and when agents can act autonomously, combining LLM decisions with deterministic rules like dollar limits, vendor blocklists, and category restrictions. Conservative finance teams can require human approval for every expense over $50, while more aggressive teams can let the agent auto-approve up to $500. This design recognises that different organisations and teams have vastly different risk tolerances, so imposing a one-size-fits-all approach would limit adoption.
Their staged rollout approach, starting with AI as suggestions before graduating to autonomous actions, mirrors successful patterns seen across the database.
Circuit Breakers: Preparing for Inevitable Failure
Even with comprehensive testing, red teaming, guardrails, and evaluation, production LLM systems will occasionally behave unexpectedly. The most mature implementations in the database acknowledge this reality and implement circuit breakers—hard limits that automatically stop agents when certain thresholds are exceeded.
Cox Automotive's autonomous customer service system, which handles dealership conversations without human oversight, implements circuit breakers on two critical metrics: cost and conversation turns. If a conversation reaches the P95 cost threshold, the agent automatically stops. Similarly, if a conversation exceeds approximately 20 back-and-forth turns, the agent stops. In either case, the system gracefully hands off to a human at the dealership who can assess whether to continue.
Their emphasis on setting these limits from day one rather than waiting for a cost explosion or customer complaint reflects mature operational thinking. The circuit breaker philosophy is about failing gracefully; when something goes wrong (and it will), the system should degrade to a safe state rather than continue potentially problematic behaviour. The thresholds are derived from production data analysis—understanding the P95 and P99 distributions of cost and conversation length under normal operation.
DoorDash's agentic platform implements similar controls they call "budgeting the loop," enforcing strict step and time limits to prevent agentic plans from thrashing.
Shadow Mode: Testing at Scale Without Risk
For organisations handling high-stakes transactions, the database reveals a consistent pattern: extensive shadow testing before live deployment.
Ramp's approach to their financial automation agents provides the most sophisticated example. They run agents in "shadow mode" on transactions before rolling out live actions. The agent predicts what it would do, and a separate "LLM Judge" compares that prediction to what the human actually did. They only turn on the agent for live actions once shadow accuracy hits a specific threshold, allowing them to test guardrails on real financial transactions without risking a single dollar.
The LLM-as-Judge Pattern: Scaling Evaluation
Manual review of LLM outputs doesn't scale. The database reveals widespread adoption of automated evaluation using what's become known as the LLM-as-judge technique.
Cox Automotive's evaluation framework generates test conversations, runs them through their agent system, and uses a separate LLM to evaluate whether responses meet quality standards. They track metrics specific to customer conversations: relevancy, completeness, and tone. Critically, the team designs their evaluation framework around their worst-case scenarios—the interactions that keep them up at night. By encoding these concerns into automated evaluation metrics, they can quantitatively track whether their mitigations are effective.
Amazon Prime Video's multi-agent architecture uses an LLM-as-judge pattern where an independent LLM evaluates outputs from their analysis agent. Their "reasoning agent" applies business context to validate whether findings are pertinent and can iteratively invoke different capabilities if the analysis is insufficient. While expensive to run, the team configures judges selectively, only invoking them when pass/fail metrics alone are insufficient for improvement decisions.
The LLM-as-judge approach isn't perfect; the judge model has its own limitations and biases, but it provides scalable, automated quality assessment that would be impossible through manual review alone. Combined with traditional testing of deterministic components and red teaming for security, it forms part of a comprehensive quality assurance strategy for production LLM systems.
Red Teaming as Ongoing Practice
Cox Automotive's approach to red teaming stands out for its emphasis on continuous practice rather than one-time assessment. They explicitly distinguish red teaming from traditional testing: testing checks what works, while red teaming tries to break it. This practice cannot be left to the end of development but must be integrated throughout the development lifecycle.
Their red teaming efforts included attempting to prompt agents to respond in foreign languages (when they should only respond in English), feeding unreadable characters to test input handling, attempting to extract system prompts through social engineering attacks, and testing edge cases in conversational flow. The team red teamed before alpha, before beta, and continues red teaming in production after every code deployment and after every prompt change. Each exploit is catalogued, fixed, and used to strengthen the system's resilience.
This practice proved essential because even with comprehensive red teaming, the non-deterministic nature of LLMs means unexpected behaviours will still emerge. The red teaming process builds organisational knowledge about failure modes and creates a systematic approach to hardening systems before customers encounter issues.
What This All Means
The transformation from vibe checks to systematic engineering reflects a broader maturation in how organisations think about LLM reliability. Three patterns emerge consistently:
Infrastructure over prompts: The most reliable guardrails are implemented in code rather than prompts. Architectural approaches like session tainting, dual-layer permissions, and API-based authorisation provide guarantees that prompt engineering cannot.
Hybrid systems: Traditional ML, deterministic rules, and LLMs each have strengths. The most effective production systems combine all three, using each where appropriate rather than attempting to solve everything with foundation models.
Graceful degradation: Circuit breakers, human handoffs, and confidence thresholds acknowledge that perfect reliability is impossible. The goal isn't preventing all failures but ensuring failures are detected quickly and handled gracefully.
The gap between the vibe check and these sophisticated systems is substantial. But what's perhaps most significant is that these practices aren't emerging from research labs—they're being developed by teams shipping production systems that handle real financial transactions, customer conversations, and healthcare analytics. The engineering is happening in the trenches, and the database captures what's working.
6. Software Engineering Skills: The Hidden Bottleneck

The novelty of the LLM has worn off. What remains, as documented across hundreds of case studies in the database, is a clear pattern: the teams shipping reliable production systems are distinguished less by AI research credentials than by their software engineering fundamentals.
This observation might sound deflating to anyone who believed that generative AI would somehow bypass traditional engineering disciplines. It won't.
Infrastructure Before Intelligence
The most successful teams in the database solve infrastructure problems rather than AI problems. This inversion of priorities runs counter to the narrative that models are the bottleneck. In practice, getting the model to work is often the straightforward part. Making it work reliably, at scale, without bankrupting your organisation, requires deep systems thinking.
LinkedIn's AI engineering team provides a telling example. Building their generative AI assistant, the engineering work focused on managing capacity and latency tradeoffs, building async non-blocking pipelines, and handling the peculiarities of streaming responses that need to be progressively parsed while API calls fire before full LLM responses complete. Their characterisation of prompt engineering as "more of an art than a science," comparing it to "tweaking rules in expert systems," underscores where they found the real work: everywhere except the model itself.
Meta faced infrastructure crises on multiple fronts. Their checkpoint data grew from hundreds of gigabytes to tens of terabytes within a single year. Training jobs spanning multiple data centres created network bottlenecks that left expensive GPUs idle, with job read latency spiking from 8 seconds to 300 seconds. The solution required redesigning network architecture: ECMP-based load balancing, BGP-based virtual IP injection, lightweight netkit Ethernet devices for direct NIC access, and eBPF kernel hooks to drive packets directly to physical NICs. This level of systems engineering, covering routing protocols, congestion control, and multi-NIC load balancing, sits far from anything resembling "AI work." Yet it's precisely what enabled their distributed training to function at scale, achieving 300x improvement in latency and 8x improvement in checkpoint loading.
Durable Execution: Treating State Seriously
If there's a single engineering pattern that separates mature production systems from fragile prototypes, it's durable execution. Long-running agent tasks will fail. Networks timeout, APIs return errors, services go down. The question is whether your system can recover gracefully or needs to restart from scratch.
Slack's Developer Experience team uses Temporal to orchestrate their multi-agent escalation system, which handles over 5,000 requests monthly. The critical insight is what Temporal provides: if a long-running research agent fails mid-task, it resumes exactly where it left off rather than restarting the conversation. The workflow maintains state across the entire escalation lifecycle until resolution. As they describe it, Temporal handles durability, automated retries, and state management in a database—even if the backend fails, the workflow resumes where it left off. Far from a nice-to-have feature, this durability distinguishes agents that work in production from those that merely function in demos.
Railway's infrastructure monitoring system uses Ingest for similar durability. Their autonomous remediation system detects issues, generates diagnostic plans, and deploys coding agents to create pull requests with fixes. When steps succeed, their results are cached, so subsequent retries continue from the failed step rather than repeating successful work.
Agent Orchestration as Microservices
The most sophisticated agent implementations in the database don't treat agents as standalone units. They treat them as microservices, applying decades of lessons from distributed systems architecture.
LinkedIn's approach to agent-to-agent communication builds on their existing messaging infrastructure and gRPC for inter-agent coordination. Rather than inventing new protocols, they applied patterns that have proven robust at scale: message queues for reliable delivery, service discovery for dynamic agent registration, and the same monitoring and alerting infrastructure they use for traditional services.
DoorDash applies the same thinking: orchestrators that decompose requests into subtasks, progress trackers for dependencies, and persistent workspaces that allow agents to share artefacts across sessions.
Perhaps most instructively, GetOnStack learned the hard way what happens when agent orchestration is treated casually. Their multi-agent system for market data research escalated from $127 in weekly costs to $47,000 over four weeks due to an infinite conversation loop between agents running undetected for 11 days. Agent A requested help from Agent B, which in turn asked Agent A for clarification, creating a recursive pattern that neither agent had the logic to break. Their subsequent investment of six weeks building message queues, circuit breakers, cost controls, and monitoring represents exactly the infrastructure that traditional distributed systems would have had from the start.
The Hardware Layer
Some teams go deeper than frameworks and orchestration. They optimise the hardware itself.
Cursor's approach to building an AI-native code editor demonstrates this intensity. Rather than accepting that LLM inference is slow, they built their own fast agent models, trained custom models fine-tuned for their specific use cases, and recently launched an agent model significantly faster than competing offerings. The focus on speed is strategic: their observation that users would "literally watch CLI agents work" drove them to optimise for a faster output experience that enables flow state coding rather than waiting. This required going beyond model selection to actual model training and optimisation.
H2O.ai's storage optimisation tackled a different hardware challenge. Managing over 2 petabytes of EBS storage with only 25% utilisation, they were effectively paying for 4x more capacity than they used. The pattern is common in ML operations: teams overprovision to ensure they don't run out of space mid-training, since running out of storage during a multi-day training run could be catastrophic. Their solution, autonomous storage management that dynamically scales volumes up and down based on actual usage, improved utilisation to 80% and reduced their storage footprint from 2 petabytes to less than 1 petabyte. Storage engineering isn't glamorous, but it's directly tied to operational costs and system reliability.
Human-in-the-Loop as Architecture
The database reveals a consistent pattern: the highest-stakes systems don't try to remove humans from the loop. They architect human review as a first-class component.
This reflects a recognition of where autonomy is appropriate, rather than a distrust of AI.
Incident.io built what they call "time travel evaluation" to grade their AI investigations. Since they know the actual root causes of past outages, they can replay historical incidents to their agent and assess whether it correctly identifies the cause. This lets them evaluate whether the agent's understanding lags behind or leads the human responders, ensuring the agent doesn't hallucinate a fix that wasn't actually possible at that specific moment in time. It's a form of evaluation only possible because of the structured nature of their domain—and it represents sophisticated thinking about how to validate AI behaviour against ground truth.
The theme across these implementations is progressive autonomy. Systems start with AI as suggestions, graduate to autonomous actions for high-confidence cases, and maintain human approval for edge cases or high-stakes decisions.
What the Skills Shortage Actually Looks Like
The database paints a clear picture of what production LLM systems actually require. The core skills being deployed are less about "prompt engineering" and more about:
Distributed systems and platform engineering: Managing state, consistency, and consensus across agents. The patterns LinkedIn and DoorDash apply, such as message queues, service meshes, and orchestration frameworks, come directly from distributed systems literature. Building internal platforms to democratise AI access while enforcing governance, as Slack's evolution from SageMaker to Bedrock demonstrates, represents this discipline at its core.
Networking and infrastructure: Optimising data flow, latency, and hardware utilisation. Meta's network redesign represents classic infrastructure engineering rather than AI work.
Security and compliance: Architecting for strict data isolation and regulatory adherence. Session tainting, dual-layer permissions, and API-based authorisation require security engineering expertise, not ML knowledge.
Closing the gap between demo and production requires wrapping the AI in systems that handle failure gracefully, scale predictably, and integrate with existing infrastructure. The teams succeeding in production are defined by their deep systems engineering expertise rather than the quality of their prompts.
The Uncomfortable Truth
If you're looking for competitive advantage in production LLM systems, the database suggests looking beyond AI research. The differentiation isn't in model selection or prompt optimisation, as those are increasingly commoditised. The differentiation lies in the infrastructure that makes models useful: the durable execution frameworks that handle failures, the networking optimisations that enable scale, the storage systems that keep costs manageable, and the architectures that integrate human oversight without sacrificing automation benefits.
This isn't what most teams want to hear. Building robust distributed systems is hard, slow work without the excitement of new model releases. But the evidence is consistent across hundreds of case studies: the bottleneck is the engineering required to deploy intelligence reliably. The teams recognising this (and investing accordingly) are the ones shipping systems that actually work.
7. Hyperscalers and the Allure of the Frontier

There's a seductive narrative that runs through every model announcement and every benchmark chart: wait for the next release, and your production problems will solve themselves. GPT-6 will reason better. Gemini 4 Pro will have a larger context window. The next Claude ("when 5.0?!") will be faster and cheaper. Just wait.
The case studies in the database tell a different story. The teams shipping reliable production systems in 2025 aren't waiting for anything. They've stopped treating frontier models as magic and started treating them as what they are: probabilistic components in high-stakes systems. And they're building accordingly.
The Frontier Model Isn't a Silver Bullet
The promise of bigger, smarter models runs directly into a wall when these models meet production requirements. The database is full of teams who discovered that throwing the newest, largest model at a problem often fails. Success usually comes from constraining the model rather than unleashing it.
Cubic encountered a counterintuitive lesson while building their AI code review agent. The frontier promise of "agents can do anything" led them to give their agent more tools, only to see performance degrade. The agent became confused, generating excessive false positives until developers stopped trusting it entirely. Instead of a smarter model, they needed streamlining: removing tools and forcing the agent to output explicit reasoning logs before acting. Fewer capabilities, better results.
Amazon's Alexa team discovered that adding more few-shot examples actually decreased accuracy because the model overfitted to the examples. They had to delete examples to improve performance, directly contradicting the assumption that more context equals better.
Treating Models as Untrustworthy Components
The most successful implementations in the database share a philosophical foundation: they explicitly treat the LLM as a chaotic component that must be contained, verified, and restricted. The language is revealing. These teams don't talk about "empowering" their models. They talk about "constraining" them.
PwC moved beyond "probabilistic validation" to mathematical verification, implementing Automated Reasoning checks that formally verify LLM outputs against logic rules derived from policy documents.
Stripe's compliance investigation agents embody this philosophy in their architecture. Their lead engineer explicitly describes the idea of end-to-end automation as a "fairytale." Instead, they decomposed complex compliance reviews into bite-sized tasks orchestrated by a directed acyclic graph. The agents operate on strict "rails" to prevent them from "rabbit-holing" on irrelevant data. The LLM functions as a worker rather than a manager. The architecture ensures that even if the model behaves unexpectedly, the damage is contained to a single task in a structured workflow.
Zalando's postmortem analysis pipeline encountered a particularly insidious failure mode: "Surface Attribution Error." When analyzing incident reports, the LLM would blame a technology (like S3) simply because it was mentioned in the text, not because it actually caused the problem. They couldn't trust the model's reasoning capabilities blindly. Their solution was a multi-stage pipeline where specific stages were solely responsible for classifying causality, with careful prompt design including negative examples to prevent lazy attributions. Even with Claude Sonnet, approximately 10% attribution errors persist. The model can't be trusted to reason correctly about causality without architectural guardrails.
Engineering the Harness, Not Waiting for the Engine
While the "irresistible allure of the frontier" keeps teams watching for the next model release, the organisations winning in 2025 are spending their time engineering the system around the model rather than waiting for the model to improve.
Trainline's travel assistant built a "user context simulator" that generates synthetic tickets corresponding to real trains running in real time, then samples actual customer queries to test whether the assistant correctly handles dynamic, time-sensitive information. Standard benchmarks were insufficient because their assistant must handle queries that depend on real-time conditions. They engineered a simulation harness because static evaluation was impossible for their domain.
Care Access faced high costs processing medical records and re-engineered their architecture to use prompt caching, separating static medical records (cached) from dynamic questions. This cut costs by 86% and improved speed by 3x through infrastructure engineering rather than model upgrades.
Value Extraction Now, Not Later
The database reveals teams extracting massive value from current-generation models by focusing on integration and data foundations rather than model intelligence. They're not waiting for GPT-6. They're shipping with GPT-5.
Rocket Companies consolidated 10 petabytes of data to fuel their AI strategy. The value stemmed from the unified data foundation rather than a smarter model that allowed existing models to reduce mortgage approval times from weeks to 8 minutes. Their agentic AI applications, which are currently running in production for executive decision-making, query the same governed data platform that powers their BI dashboards. The differentiation is data infrastructure, not model selection.
Robinhood reduced latency by 50% through what they call a "hierarchical tuning approach," which starts with prompt optimisation, then trajectory tuning with dynamic few-shot examples, and finally LoRA fine-tuning. They didn't wait for a faster model. They engineered their way to the necessary speed, reducing P90 latencies from up to 55 seconds to under one second. The fine-tuned 8B model matches the quality of frontier models for their specific use case while meeting the latency requirements that a general-purpose frontier model couldn't satisfy.
The Hyperscaler Ecosystem: Enabling, Not Solving
The cloud providers and their managed AI services appear throughout these case studies, but their role is instructive. They're enabling infrastructure, not solutions. Teams use Amazon Bedrock for standardised model access and prompt caching. They use SageMaker for fine-tuning and deployment. They use managed services for infrastructure that would be tedious to build. But the differentiation (the engineering that makes systems work in production) happens above these platforms.
Stripe chose Bedrock specifically for its unified security vetting across model providers—essential for their regulated environment—and its prompt caching capabilities that address the quadratic cost problem in iterative agent loops. They built an internal LLM proxy service on top to handle traffic management, model fallback, and bandwidth allocation. While the hyperscaler provides the foundation, Stripe built the operational layer that makes it work for compliance investigations.
The pattern suggests that waiting for better managed services won't solve production problems any more than waiting for better models will. The services are increasingly capable, but the engineering work remains: designing workflows, managing state, handling failures, validating outputs, and building the harnesses that make unreliable components reliable.
Production Problems Aren't Capability Problems
Alexa's team articulated this clearly: the early excitement when they first got their LLM system working gave way to the realisation that "far more work lay ahead than they had completed." Getting the model to demonstrate capability was the easy part. Making it work reliably at scale across 600 million devices required inventing techniques around prompt caching, speculative execution, context engineering, and output token minimisation. These innovations came from production engineering rather than model providers.
The allure of the frontier is that it promises to make this engineering unnecessary. The reality, documented across hundreds of case studies, is that it doesn't. Better models shift where the engineering challenges lie, but they don't eliminate them. The teams that recognise this and invest in harnesses, constraints, verification systems, and infrastructure rather than waiting for capabilities to improve are the ones shipping systems that actually work.
The frontier will always be advancing. The question is whether you're building production systems today or waiting for a tomorrow that will bring its own set of engineering challenges. The database suggests the former is the winning strategy.
What the Database Actually Tells Us
After cataloguing over 1,200 production LLM deployments, a picture emerges that's both encouraging and sobering. Organisations are shipping systems that work, generating real revenue, reducing real costs, and transforming real business processes through software engineering discipline rather than access to secret models or proprietary techniques. But none of this is easy, and waiting for it to become easy is a poor strategy. Frontier models shift where production problems occur rather than solving them. Every improvement in capability creates new integration challenges, new failure modes, new cost curves to optimise. The teams extracting value today are the ones investing in infrastructure that will make any model production-ready, closing the gap between demo and production through unglamorous engineering work rather than breakthrough AI advances.
Several patterns have crystallised into what looks like industry consensus. Context engineering has emerged as a genuine discipline: teams have learned that leaner contexts produce smarter outputs, that tool definitions are prompts, and that the million-token context window is a ceiling to stay under rather than a feature to exploit. Guardrails have moved out of prompts and into infrastructure, where architectural constraints provide guarantees that prompt engineering cannot.
Other areas remain in active flux: agent harnesses, memory systems, and reinforcement learning show promising results but haven't stabilised into consensus patterns, at least not across diverse use cases, and these patterns aren't easily reproducible outside the major labs.
Perhaps the most consistent finding across the database is that the bottleneck is engineering rather than intelligence. Instead of prompt engineering or model fine-tuning, the skills in demand are distributed systems, networking, infrastructure management, and the ability to build reliable systems from unreliable components. The teams shipping production LLM systems look remarkably like teams shipping any other critical infrastructure: disciplined about failure modes, rigorous about evaluation, and unsentimental about which parts of their architecture need to be bulletproof versus which can tolerate uncertainty.
The wild west metaphor has run its course. What we're witnessing isn't the taming of chaos through revolutionary AI breakthroughs, but the gradual emergence of engineering discipline in a domain that initially seemed to defy it. The case studies document this transition in granular detail: real teams, real systems, real problems, real solutions. The patterns are becoming recognisable. The practices are becoming repeatable. The gap between what's possible and what's production-ready is narrowing as the engineering around it has matured, rather than because the technology has simplified.
For practitioners, the implication is straightforward: invest in engineering. The organisations winning with LLMs are distinguished by their infrastructure, their evaluation practices, their operational discipline, and their willingness to treat AI systems with the same rigour they'd apply to any mission-critical technology, rather than by their model access or their prompt libraries. The database suggests this isn't going to change. If anything, as LLM systems become more ambitious and more deeply integrated into business operations, the engineering requirements will only intensify.
The experimentation phase has ended. The engineering phase has begun.



