Software Engineering

LLMOps in Production: 287 More Case Studies of What Actually Works

Alex Strick van Linschoten
Jul 17, 2025
15 mins

Back in January we published summaries of the 457 case studies that constituted the LLMOps Database at the time. We've since added a bunch more and so we're publishing the latest summaries below since it's a nicer way to get a sense of the variety of approaches and use cases in play in 2025.

I see four bigger trends in the latest batch:

Agents Are Being Used in Production (Finally!)

After years of demos and promises, we're finally seeing agent systems handling real workloads in production. But the agents that actually work look nothing like the autonomous, general-purpose systems we see in research papers (or are promised by big tech companies).

The successful production agents are surprisingly narrow. They're single-domain specialists, operating under more-or-less constant human supervision. Think of them less as autonomous entities and more as really smart, context-aware automation scripts. Deutsche Telekom's customer service platform is a perfect example - it's an agent system, sure, but one that operates within extremely well-defined boundaries with clear escalation paths to humans.

What's particularly telling is that only about 20% of the agent stories in our database involve true multi-agent architectures. And even those are often just orchestrator-worker patterns dressed up in fancy terminology. The "multi-agent" label is doing a lot of heavy lifting here - many of these systems are essentially a main controller delegating specific tasks to specialized sub-modules.

The vertical agents are where things get interesting though. When you constrain the problem space enough - say, to handling insurance claims or scheduling maintenance tasks - you can build systems that genuinely augment human capabilities rather than just automating rote tasks.

Evals Are the Critical Path

Perhaps this is a truism by now, but it seems so clear as of writing in July 2025 that you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features (or you don't care about your users / product).

LLM-as-judge has emerged as the dominant pattern for reference-free scoring, and for good reason - it scales. But don't let anyone tell you it's a complete solution. Every single successful deployment we've analyzed maintains human-in-the-loop golden datasets for their critical domains. The pattern is consistent: start with LLM judges for velocity, but anchor everything to human-validated or human-aligned ground truth.

Cost awareness is the new reality check. Teams are discovering that running comprehensive evals on every commit can burn through their inference budget faster than actual production traffic. Notion AI's approach shows how to balance this - they've built a multi-layer eval stack that runs lightweight unit tests frequently, with more expensive offline regression tests gated behind specific triggers.

The most sophisticated teams are running what amounts to defense-in-depth for LLM outputs: unit tests for prompt templates, offline regression suites for model updates, and online guardrails that act as runtime confidence filters. It's not elegant, but it works.

RAG Is Still a Thing (But It's Getting Weird)

Remember when RAG was just "stick your docs in a vector database and hope for the best"? Those days are long gone. What we're seeing now is new RAG architectures (and a bunch of new coinages and acronyms to boot).

The complexity explosion is real. Writer's graph-based RAG system exemplifies this new breed - they're not just storing embeddings, they're maintaining knowledge graphs that evolve based on usage patterns. We're talking about systems that combine vector search, keyword matching, graph traversal, and reranking pipelines, all orchestrated by yet another LLM.

The surprising part? This complexity sometimes pays off. The vanilla RAG pattern can hit a quality ceiling quickly (depending on your application's complexity), but these baroque architectures can push accuracy into the 90%+ range for domain-specific queries. The trade-off is operational complexity.

Data Flywheels Are the Moat

Here's the pattern that separates the winners from the also-rans: every sustained success story has figured out how to turn user interactions into training data. Not in some abstract "we'll analyze this someday" way, but as a core part of their product loop.

Cursor, Zapier Agents, Notion AI, and Propel's SNAP benefits assistant all share this DNA. When users correct the AI's output, that correction doesn't just fix the immediate problem - it feeds back into the system, updating prompts, fine-tuning models, or adjusting retrieval strategies.

The clever bit is how teams are bootstrapping these flywheels. Synthetic data generation has emerged as the go-to strategy for cold starts. Pinterest's search relevance work shows this perfectly - they used GPT-4 to generate training data for smaller models, creating a distillation cascade that dramatically reduced inference costs while maintaining quality.

What's fascinating is that the flywheel effect compounds faster than most teams expect. Once you have even a modest feedback loop in place, the quality improvements accelerate. The hard part isn't the technical implementation - it's designing the UX to make feedback frictionless and ensuring your legal team is comfortable with the data usage policies.

All of which is to say that we don't seem to be nearing some kind of interim stability point. 2025 and 2026 looks like we'll continue to see LLMOps architecture and infrastructure continue to evolve as new capabilities and techniques emerge.

Case Study Summaries

In the meanwhile, here are the summaries of the most recent case studies!

  • 11x - 11x rebuilt its AI Sales Development Representative (SDR) product, Alice, into a sophisticated multi-agent system using LangGraph, achieving human-level 2% reply rates. This evolution from a basic AI tool to an autonomous digital worker involved iterating through React and workflow-based architectures to land on a hierarchical multi-agent design.
  • 14.ai - 14.ai, an AI-native customer support platform, leverages the Effect TypeScript framework to build reliable LLM-powered agent systems that interact directly with end users. They use Effect across their stack to manage the complexities of production LLMs, including unreliable APIs and non-deterministic outputs, through strong type safety, dependency injection, and robust error handling, enabling sophisticated agent orchestration and fallback strategies.
  • 42Q - 42Q integrated an AI assistant named Arthur into its cloud-based Manufacturing Execution System (MES) to simplify system understanding and provide real-time production data insights. Leveraging AWS Bedrock and a RAG architecture, Arthur combines comprehensive documentation with live MES data, significantly improving user experience and knowledge accessibility.
  • Adobe - Adobe implemented "Unified Support," an AI-powered RAG system built on Amazon Bedrock Knowledge Bases, to provide thousands of internal developers with immediate, accurate answers from fragmented technical documentation. This solution, leveraging optimized chunking and metadata filtering, achieved a 20% increase in retrieval accuracy, enhancing developer productivity and reducing support costs.
  • Aetion - Aetion, a healthcare software provider, developed a Scientific Intent Translation System using Amazon Bedrock and Claude 3 Haiku to convert complex scientific queries into technical analytics measures. This solution, leveraging a sophisticated RAG system and robust guardrails, significantly reduced the time required for measure implementation from days to minutes.
  • Agoda - Agoda, a major e-commerce travel platform, integrated GPT into its CI/CD pipeline to automate SQL stored procedure optimization, significantly reducing the 366 man-days annually spent on manual tuning and streamlining their database development workflow. This system provides automated analysis, query refinements, and index recommendations, augmenting developer efficiency in a critical part of their operations.
  • Airbnb - Airbnb transformed its customer support with an ML-powered Interactive Voice Response (IVR) system, utilizing conversational AI, including LLM-based ranking, to understand natural language queries and provide intelligent self-service or routing. This advanced multi-model pipeline significantly improved ASR accuracy and intent detection latency, leading to increased self-resolution rates and reduced reliance on human agents.
  • AirBnB - AirBnB deployed LLMs in a sophisticated production pipeline to automate the large-scale migration of 3,500 React component test files, transforming an estimated 1.5 years of manual engineering effort into a 6-week project with a 97% automated success rate.
  • Alibaba - Alibaba developed a data-centric multi-agent platform for enterprise AI, leveraging the Spring-AI-Alibaba framework and tools like Higress and Nacos to deploy LLM-based systems at scale, achieving high resolution rates for customer issues.
  • Alipay - Alipay optimized its Fund and Insurance Search systems by implementing an advanced generative retrieval framework that significantly reduces LLM hallucinations. This is achieved through a novel combination of knowledge distillation reasoning and a decision agent for post-processing, leading to improved search quality and conversion rates.
  • Amazon - Amazon developed a comprehensive security framework to deploy generative AI applications like Rufus and internal chatbots at scale, addressing unique LLM security challenges with automated testing (FAST), layered guardrails, and secure-by-default architectures. This approach enabled a secure transition from experimental AI to robust production systems through continuous monitoring and refinement.
  • Amazon Logistics - Amazon Logistics implemented a multi-agent LLM system, powered by Claude 3.7 via Amazon Bedrock, to optimize complex package delivery planning. This solution combines graph-based analysis with AI agents to process vast data, capture tribal knowledge, and potentially save up to $150 million by improving planning accuracy.
  • An Garda Siochanna - An Garda Siochanna (Irish Police Force) is digitally transforming its operations with AI-enhanced body cameras and cloud-based digital evidence management, with future plans to integrate LLM capabilities for automated report generation and language translation. This initiative navigates the complexities of deploying AI in a highly regulated law enforcement environment, balancing operational effectiveness with stringent privacy and security requirements.
  • ANNA - ANNA, a UK business banking provider, implemented a hybrid ML and LLM approach for cost-effective transaction categorization in its AI accountant system. By strategically combining offline processing, context window optimization, and prompt caching, they achieved a 75% reduction in LLM costs while maintaining high accuracy for complex business-specific rules.
  • Anomalo - Anomalo developed an LLMOps platform on AWS to address the critical challenge of unstructured data quality for enterprise AI, leveraging Amazon Bedrock for LLM-based analysis to automate document processing, anomaly detection, and PII governance. This solution ensures enterprises can reliably transform vast amounts of unstructured data into high-quality assets for production AI systems.
  • Anterior - Anterior, a healthcare AI company, developed a scalable LLM evaluation system for prior authorization decisions, achieving nearly 96% accuracy while processing high volumes of medical requests. Their innovative approach uses LLMs as judges for real-time, reference-free evaluation, dynamically prioritizing cases for human expert review and significantly reducing the need for large clinical teams.
  • Anthropic - Anthropic tackled the "integration chaos" stemming from rapidly scaling LLM tool calling by implementing a Model Context Protocol (MCP) gateway. This centralized infrastructure standardizes LLM integrations, streamlining authentication, credential management, and routing, which significantly reduces engineering overhead and enhances security and operational efficiency.
  • Anthropic - Anthropic developed a multi-agent research system for Claude's Research feature, employing an orchestrator-worker pattern where a lead agent coordinates specialized subagents to conduct parallel information retrieval and synthesis across diverse sources. This sophisticated architecture scales LLM capabilities beyond single-agent limitations, achieving significant performance improvements for complex information tasks.
  • Anthropic - Anthropic developed Claude Code, a CLI-based coding assistant that leverages their Sonnet LLM to enhance software development workflows. This tool, built with an emphasis on simplicity and composability, has demonstrated significant developer productivity gains (2x-10x) while operating on a cost-effective pay-as-you-go model.
  • Anthropic - Anthropic developed the Model Context Protocol (MCP) to standardize how AI applications integrate with external tools and services, akin to the Language Server Protocol for IDEs. This open, community-driven protocol, leveraging JSON-RPC, simplifies the extension of AI capabilities and has seen adoption from companies like Microsoft and Shopify.
  • Anthropic - Anthropic, a rapidly growing LLM developer, details its sophisticated LLMOps practices for scaling and operating frontier models like Claude. This includes tackling distributed computing challenges, pioneering Constitutional AI for safety, and evolving training pipelines to ensure model coherence and reliability in production.
  • Anthropic - Anthropic developed an LLM-powered agent that plays Pokemon, showcasing Claude's capabilities in long-running, complex decision-making and serving as a valuable internal tool for model evaluation. This project highlights advancements in long-horizon planning, context management, and learning from experience.
  • Aomni - Aomni, a provider of AI agents for enterprise sales teams, continuously evolves its LLMOps practices by simplifying agent architectures and removing guardrails as language model capabilities improve. This "don't bet against the model" philosophy enables them to build more flexible and powerful agents with significantly less code.
  • Apoidea Group - Apoidea Group, a FinTech ISV, operationalized fine-tuned multimodal models like Qwen2-VL-7B-Instruct to revolutionize banking document processing, reducing manual effort from hours to minutes and achieving an 81.1% TEDS score in a highly regulated environment.
  • Apollo Tyres - Apollo Tyres implemented an agentic AI Manufacturing Reasoner, powered by Amazon Bedrock's multi-agent architecture, to automate root cause analysis for their tire curing processes. This solution leverages real-time IoT data and generative AI to reduce manual analysis time by 88%, transforming a 7-hour task into a sub-10-minute automated process and enabling real-time bottleneck identification.
  • Apple - Apple's Apple Intelligence features a large-scale deployment of generative AI, powered by a hybrid architecture combining a compact 3-billion parameter on-device model and a server-based mixture-of-experts model. This sophisticated LLMOps approach enables the delivery of advanced consumer AI capabilities to hundreds of millions of users while maintaining strict privacy, efficiency, and quality.
  • Art Institution - An art institution deployed a sophisticated multimodal search system for its 40 million art assets, leveraging LLMs and vector databases to enable complex text and image-based queries. This production-grade solution successfully optimized for cost and performance, delivering high-quality search results for a specialized domain.
  • Articul8 - Articul8 developed a generative AI platform for manufacturing and supply chain optimization, exemplified by its use at a European automotive manufacturer. This platform, leveraging a "model mesh" and knowledge graph technology, automated root cause analysis for vehicle defects, reducing incident dissemination time by 3x and preserving critical expert knowledge.
  • Articul8 - Articul8, a generative AI company, uses Amazon SageMaker HyperPod to scale the training of its domain-specific models (DSMs). This distributed infrastructure enables them to develop specialized LLMs that significantly outperform general-purpose models in accuracy and efficiency, leading to a 4x reduction in AI deployment time and 5x lower total cost of ownership.
  • AskNews - AskNews developed an automated news analysis and bias detection platform that processes 500,000 articles daily using open-source LLMs like Llama 2 and 3.1 for fact extraction, bias assessment, and knowledge graph creation. This system leverages edge computing to cost-effectively provide nuanced understanding of global news perspectives and identify contradictions across sources.
  • AstraZeneca - AstraZeneca implemented a "Development Assistant," an interactive AI agent designed for natural language querying of clinical trial data. This solution evolved from a single-agent to a scalable multi-agent architecture on Amazon Bedrock, enabling robust data analysis across diverse R&D domains while addressing challenges like performance and domain-specific terminology.
  • AWS - AWS Sales implemented an AI-powered account planning assistant, leveraging Amazon Bedrock and a sophisticated RAG architecture, to streamline the creation of detailed account plans. This enterprise-grade solution integrates diverse data sources, substantially reducing planning time and enabling sales teams to focus more on customer engagement.
  • AWS - Optimizing GPU memory transfer to 3200 Gbps on AWS SageMaker Hyperpod is shown to be critical for achieving high-performance, production-ready large language model training and inference.
  • Bell - Bell, a major telecommunications company, engineered a modular and scalable RAG system featuring a hybrid batch and incremental processing architecture. This solution, built on Cloud Composer and Apache Beam, efficiently manages both static and dynamic knowledge bases, enabling rapid deployment of diverse RAG applications.
  • Benchling - Benchling built a RAG-powered Slackbot using Amazon Bedrock and Claude 3.5 Sonnet to provide their infrastructure team with instant, grounded answers to complex Terraform Cloud questions, effectively streamlining knowledge access from diverse internal and public sources.
  • Bismuth - Bismuth, a software agent startup, developed SM-100, a novel benchmark to evaluate AI agents' capabilities in software bug detection and maintenance tasks, an area where existing benchmarks fall short. Their findings reveal that while popular agents excel at feature development, they struggle significantly with real-world bug identification, often achieving low accuracy and high false positive rates, highlighting the current limitations of LLMs in complex software maintenance.
  • BlackRock - BlackRock's Aladdin Copilot, an AI-powered assistant, integrates generative AI into its core investment management platform using a supervised agentic architecture built on LangChain and GPT-4 function calling. This system helps users navigate complex financial workflows and democratizes access to investment insights, all while meeting the stringent accuracy and compliance demands of the financial services sector.
  • BMW Group - BMW Group deployed a generative AI solution powered by Amazon Bedrock Agents to automate root cause analysis for its extensive connected vehicle fleet, significantly reducing diagnosis times and achieving 85% accuracy in identifying complex cloud incidents. This sophisticated LLMOps approach leverages custom tools for architecture, logs, metrics, and infrastructure analysis.
  • Bolbeck - This case study distills practical lessons from 18 months of deploying GenAI applications, highlighting the architectural complexities, infrastructure demands, and critical considerations for ensuring response accuracy, managing costs, and implementing robust observability in production LLM systems.
  • Bosch Engineering / AWS - Bosch Engineering, in collaboration with AWS, developed a next-generation AI-powered in-vehicle assistant with a hybrid edge-cloud architecture, leveraging LLMs to provide intelligent conversational interfaces that handle complex multi-step requests and integrate seamlessly with external services. This solution, implemented on Bosch's Software-Defined Vehicle demonstrator, also incorporates robust LLMOps for continuous model improvement and deployment at scale.
  • Box - Box, a B2B unstructured data platform, evolved its LLM-powered document data extraction from a simple linear pipeline to a sophisticated multi-agent architecture. This redesign addressed critical production challenges like context window limitations and OCR quality, enabling robust, enterprise-scale processing with improved accuracy and maintainability.
  • Box - Box, a B2B unstructured data platform, initially deployed a straightforward LLM-based metadata extraction system that, despite early success, struggled with enterprise-scale complexity. To overcome these limitations, they evolved to a multi-agent architecture, enabling intelligent field grouping, adaptive processing, and robust quality feedback for more reliable and scalable document processing.
  • BT - British Telecom (BT) is revolutionizing its extensive mobile network operations by implementing AI, ML, and generative AI to achieve an autonomous "Dark NOC" vision. This transformation involves building robust data foundations with AWS, upskilling engineering teams, and deploying an agentic AI framework for automated analysis, predictive maintenance, and self-healing network capabilities.
  • Build.inc - Build.inc developed Dougie, a sophisticated multi-agent system, to automate complex commercial real estate development workflows for data center projects. This hierarchical architecture, leveraging LangGraph for orchestration, reduces a four-week manual process to just 75 minutes by employing over 25 specialized agents in parallel.
  • Bundesliga / Harness / Trice - A roundtable of DevOps and AI experts discussed the practical integration of generative AI into production workflows, highlighting successful applications in areas like code generation and test automation. The discussion also covered critical considerations for security, managing non-deterministic outputs, and ensuring effective team adoption while maintaining human oversight.
  • ByteDance - ByteDance deployed multimodal LLMs on AWS Inferentia2 to efficiently process billions of videos daily for content moderation and understanding. This large-scale implementation achieved a 50% cost reduction while maintaining high accuracy, leveraging techniques like tensor parallelism and model quantization.
  • Capgemini - Capgemini is modernizing automotive software development with its LLM-powered "amplifier" accelerator, which leverages AWS Bedrock to convert whiteboard ideas into formal requirements and automate virtual testing, drastically cutting development cycles from weeks to hours.
  • Capital One - Capital One's Enterprise AI team developed robust input guardrails for LLM-powered applications, employing an LLM-as-a-Judge approach enhanced by Chain-of-Thought fine-tuning. This method, combined with techniques like SFT, DPO, and KTO, significantly improved attack detection rates by over 50% across various open-source models, demonstrating effective safety measures with minimal data and computational resources.
  • Casetext - Casetext successfully deployed GPT-4 in production to create Co-Counsel, an AI legal assistant that transformed legal work by establishing rigorous test-driven development and prompt engineering practices to ensure LLM reliability for mission-critical applications, leading to a $650 million acquisition.
  • Choco - Choco, a food supply chain technology company, built an LLM-based system to automate order processing from diverse formats by leveraging few-shot learning with dynamically retrieved examples and rich context injection. This approach prioritizes prompt-based improvements and human-in-the-loop systems for continuous adaptation and high accuracy in production.
  • Choco - Choco developed Choco AI, a system leveraging a modular LLM architecture to automate and scale unstructured order processing for food and beverage distributors, achieving over 95% prediction accuracy and enabling customers to reduce manual order entry time by 60%.
  • Circle - Circle, a fintech company, developed an experimental AI-powered escrow agent system that leverages OpenAI's multimodal models with their USDC stablecoin and smart contract infrastructure. This solution automates agreement parsing and work verification, enabling near-instant, programmable money settlement while maintaining human oversight.
  • Cisco - Cisco implemented a multi-agent AI platform, built on LangChain, to transform its customer experience operations across a 20,000-person organization. This sophisticated system integrates traditional machine learning with LLMs, achieving 95% accuracy in risk recommendations and automating 60% of their annual support cases.
  • Clario - Clario, a clinical trials data solutions provider, successfully deployed a generative AI solution to automate the creation of complex clinical trial documentation. Leveraging a RAG architecture with Amazon OpenSearch and Claude 3.7 Sonnet via Amazon Bedrock, the system significantly enhances accuracy and efficiency in a highly regulated production environment.
  • Cleric - Cleric is developing AI Site Reliability Engineering (SRE) agents that leverage LLMs and multi-layered knowledge graphs to diagnose and troubleshoot production system issues. This system aims to reduce engineer workload and improve incident response by providing reliable, actionable insights.
  • Cleric / DOCETL - Cleric and DOCETL deploy LLM-powered systems in production for automated alert root cause analysis and natural language data processing, respectively. These case studies reveal critical challenges in productionizing AI, such as the lack of ground truth for validation and bridging the "gulf of specification," which they address through sophisticated feedback loops and simulation environments.
  • ClimateAligned - ClimateAligned developed a RAG-based system to analyze climate-related financial documents for major financial institutions, leveraging LLMs, hybrid search, and human-in-the-loop processes. This approach enabled them to achieve 99% accuracy and reduce analysis time from two hours to 20 minutes per company, showcasing effective LLM deployment in production.
  • Cognition - Cognition developed Devon, an autonomous AI software engineer designed to integrate into large-scale codebases by generating pull requests from tickets. It achieves this through DeepWiki, a real-time codebase indexing system, and specialized model training using reinforcement learning, enabling it to function as an AI teammate that learns and adapts to organizational code patterns.
  • Coursera - Coursera implemented a robust LLMOps framework to evaluate and assure the quality of its AI-powered educational tools, including the Coursera Coach chatbot and an AI-assisted grading system. This structured approach, combining heuristic checks and LLM-as-judge evaluations, significantly increased development confidence and accelerated feature deployment.
  • Crew AI / Galileo - Crew AI and Galileo are at the forefront of building production-ready AI agent systems, addressing the complex challenges of multi-agent orchestration and LLMOps at scale. Their work focuses on developing robust evaluation and observability frameworks, managing non-deterministic agent behavior, and establishing governance for enterprise-grade deployments.
  • CrowdStrike - CrowdStrike developed Charlotte AI, an agentic AI system, to automate complex cloud security incident detection, investigation, and response workflows, addressing the escalating volume and speed of cloud-based threats. Leveraging LLMs and deep integration with its Falcon platform, Charlotte AI provides automated triage, correlates multi-layered cloud data, and generates detailed, actionable incident reports, significantly enhancing security operations.
  • Cursor - Cursor, an AI-powered code editor, has scaled to over $300 million in revenue by deploying sophisticated LLMOps practices, including multi-model integration and custom retrieval systems, to power features like real-time code completion and agentic workflows for millions of developers.
  • Cursor - Cursor, a code editing platform, is implementing reinforcement learning (RL) to train its AI-assisted coding models and agent systems in production, addressing unique challenges in reward signal design and building sophisticated LLMOps infrastructure for high-throughput training and long context handling.
  • Cursor - Cursor developed an AI-powered code editor that leverages a hybrid LLM strategy, combining frontier models with custom-trained models for features like instructed edits and codebase indexing. This pragmatic approach, focused on user experience and rapid iteration, enabled them to achieve significant growth and redefine AI-assisted software development.
  • Cursor - Cursor, an AI-assisted coding platform, scaled its infrastructure to process 100 million daily model calls for its custom LLMs, navigating complex challenges in global deployment, including database refactoring and multi-provider rate limit management.
  • Decagon - Decagon has built a comprehensive AI agent system for customer support, powered by an "AI Agent Engine" that leverages LLMs for multi-channel interactions and tool calling. This production-grade solution features intelligent routing, agent assist capabilities, and robust LLMOps practices including extensive testing and monitoring.
  • Deepgram / LangChain - This case study explores building voice-enabled AI assistants in production, detailing how to achieve sub-second latency for natural conversations using technologies like Deepgram for speech processing and LangChain for memory management, while addressing key deployment considerations.
  • Deutsche Telekom - Deutsche Telekom developed LMOS, a comprehensive multi-agent LLM platform, to automate customer service across 10 European countries and multiple channels. This sophisticated system has successfully handled over 1 million customer queries with an 89% acceptable answer rate and dramatically improved development velocity.
  • DoorDash - DoorDash engineered a production-grade LLMOps system to automatically generate personalized menu descriptions for restaurants, leveraging a three-pillar architecture that integrates multimodal data retrieval, adaptive content generation, and continuous evaluation with human feedback.
  • DoorDash - DoorDash developed AutoEval, an LLM-powered, human-in-the-loop system for automated search result quality assessment at scale, replacing inefficient manual evaluations. This system leverages sophisticated prompt engineering and fine-tuned models to achieve a 98% reduction in evaluation turnaround time while matching human rater accuracy.
  • Doordash - Doordash built an LLM-based menu transcription system, employing a novel guardrail framework that uses traditional machine learning to predict transcription accuracy and route low-confidence outputs for human review, ensuring high-quality data through a robust hybrid AI-human pipeline.
  • DoorDash - DoorDash leverages large language models (LLMs) to build and maintain a dynamic product knowledge graph, structuring vast amounts of food delivery data for enhanced search, recommendations, and comprehensive menu understanding.
  • Dotdash Meredith - Dotdash Meredith, a major digital publisher, developed its AI-powered Decipher platform, leveraging LLMs and a strategic partnership with OpenAI, to deeply understand content and user intent for privacy-focused ad targeting. This system processes billions of visits, outperforming traditional cookie-based methods by combining domain expertise with real-time content grounding to drive better business outcomes.
  • Dovetail - Dovetail, a customer intelligence platform, developed an MCP (Model Context Protocol) server to enable AI agents to securely access and utilize its proprietary customer feedback data. This strategic infrastructure allows teams to integrate real-time customer intelligence into AI-powered workflows, driving automated content generation and faster, data-informed decisions.
  • Dropbox - Dropbox developed Dash, an AI-powered universal search and knowledge management product, to unify fragmented business information using a sophisticated RAG system and custom AI agents. This solution highlights robust LLMOps practices, including a secure, purpose-built interpreter for agents and a hybrid RAG implementation optimized for enterprise-grade performance and reliability.
  • Duolingo - Duolingo's "Video Call with Lily" offers AI-powered language learning conversations, leveraging structured LLM prompts, chunked processing, and dynamic quality controls to manage context and prevent overload. This robust system ensures a personalized and effective speaking practice experience for users.
  • Duolingo - Duolingo scaled its DuoRadio feature, a podcast-like audio learning experience, by implementing an AI-driven content generation pipeline. This automated system leverages LLMs for script generation and evaluation, combined with Text-to-Speech technology, to massively expand content and reduce costs.
  • Elastic - Elastic developed ElasticGPT, an internal generative AI assistant, leveraging their own Elasticsearch for RAG-based vector search combined with OpenAI's GPT models to provide secure, context-aware knowledge discovery for employees and serve as a reference architecture for clients.
  • Elastic - Elastic implemented a generative AI solution to enhance its customer support operations, starting with a Vertex AI-powered proof of concept for automated case summaries and draft replies. Based on learnings, they evolved to a production-ready RAG architecture leveraging Elasticsearch to integrate domain-specific knowledge and improve response accuracy.
  • Elastic - Elastic's Field Engineering team developed a production-grade RAG-based customer support chatbot, leveraging Elasticsearch to manage a vast knowledge base and ELSER for embeddings. This system employs a hybrid search approach and AI-generated content enrichment to deliver accurate and trustworthy responses.
  • Elastic - Elastic's Field Engineering team developed a production-ready Support Assistant chatbot, focusing on the crucial UI/UX challenges of deploying LLM-powered applications. They implemented innovative solutions for managing response latency with custom animations and advanced timeout handling, and designed a novel UI for sophisticated context management.
  • Elastic - Elastic's Field Engineering team developed and optimized a production GenAI customer support chatbot, leveraging a hybrid search approach and AI-generated summaries to achieve a 78% improvement in RAG search relevance for complex queries like CVEs and version-specific issues.
  • Elastic - Elastic developed a production-grade customer support assistant using generative AI and RAG, with a strong focus on comprehensive observability for LLMs in production. This system leverages the Elastic Stack for detailed monitoring, alerting, and performance analysis, ensuring reliability and continuous improvement.
  • Elastic - Elastic implemented a comprehensive quantitative framework for evaluating and improving its production GenAI features in security applications, such as an AI Assistant and Attack Discovery. This robust LLMOps approach, utilizing LangGraph, LangSmith, and LLM-as-judge techniques, ensures consistent quality and enables data-driven optimization for enterprise-scale deployments.
  • Entelligence - Entelligence is developing an AI-powered platform that leverages multiple LLMs (including Claude, GPT, and Deepseek) to automate code reviews, manage documentation, and provide context-aware search, significantly streamlining engineering operations and improving code quality by learning from team feedback.
  • Exa - Exa developed a sophisticated multi-agent web research system, evolving from a search API to autonomously find and deliver structured information using LLMs. Orchestrated with LangGraph and observed via LangSmith, the system dynamically generates parallel tasks, optimizes token usage through intelligent content retrieval, and processes hundreds of daily queries with structured JSON outputs.
  • Exa.ai - Exa.ai built a sophisticated, large-scale GPU infrastructure, combining 224 NVIDIA A100 and H200 GPUs, to train neural web retrieval models. Their five-layer LLMOps stack, utilizing Pulumi, Alluxio, and Flyte, enables efficient, reproducible, and reliable operation of complex AI workloads at scale.
  • Factiva - Factiva, a Dow Jones business intelligence platform, implemented "Smart Summaries," an enterprise-scale LLM solution powered by Google's Gemini model, to enable natural language querying across its vast repository of nearly 3 billion licensed articles. This deployment meticulously manages intellectual property rights by integrating a new GenAI licensing framework, ensuring transparent attribution, and tracking royalty compensation for thousands of content providers.
  • Factory - Factory is developing an AI agent-driven software development platform for large enterprise engineering teams, aiming to automate end-to-end tasks from tickets to pull requests. Their "droids" leverage sophisticated LLMOps principles including advanced planning, contextual decision-making, and deep environmental grounding with existing development tools.
  • Factory.ai - Factory.ai developed an enterprise-focused autonomous software engineering platform, leveraging specialized AI "droids" to independently handle complex coding tasks and large-scale migrations. This browser-based system integrates with existing enterprise tools and achieves dramatic time savings by optimizing for delegation rather than collaborative AI workflows.
  • Fastmind - Fastmind built a scalable, LLM-powered chatbot platform designed for thousands of users, emphasizing cost-efficiency, performance, and multi-layered security. Their architecture leverages edge computing with Cloudflare Workers, robust rate limiting, and Cohere's AI models to ensure reliable and secure GenAI operations.
  • fewsats - Fewsats enhanced its domain management AI agents by modifying its HTTP SDK to expose complete error details from API responses, rather than just status codes. This crucial change enabled LLM-powered agents to self-correct and effectively resolve API errors in production, eliminating "doom loops."
  • Figma - Figma implemented an AI-powered visual search solution to help users locate specific designs and components across large organizations. Leveraging the CLIP multimodal embedding model and AWS services like SageMaker and OpenSearch, the system processes billions of entries at scale, incorporating extensive cost optimization strategies.
  • Figma - Figma implemented AI-powered visual and semantic search within its design platform to help users efficiently find existing designs. This involved deploying LLMs in production, utilizing RAG, managing billions of embeddings, and developing robust quality evaluation systems to solve practical challenges in a user-centric manner.
  • Fintool - Fintool, an AI equity research assistant, implemented a comprehensive LLMOps workflow to process vast amounts of unstructured financial data, ensuring high accuracy and trustworthiness for institutional investors. This involved continuous evaluation with automated LLM-as-a-judge systems, dynamic golden datasets, and human-in-the-loop oversight, enabling them to scale financial insights in a highly regulated sector.
  • FloQast - FloQast developed an AI-powered accounting automation solution using Anthropic's Claude 3.5 Sonnet on Amazon Bedrock, leveraging Bedrock Agents and Textract to streamline complex transaction matching and document annotation, resulting in significant reductions in reconciliation and audit times.
  • Formula 1 - Formula 1's AI-powered root cause analysis assistant, built with Amazon Bedrock, leverages generative AI and RAG to streamline the resolution of critical IT issues during live race events. This system significantly reduces troubleshooting time from weeks to minutes by enabling engineers to quickly diagnose problems and receive remediation recommendations.
  • Fujitsu - Fujitsu developed an AI-powered sales proposal automation system using a multi-agent architecture, orchestrating specialized AI agents via Azure AI Agent Service and Semantic Kernel. This sophisticated solution integrates with existing knowledge bases, leading to a 67% productivity improvement in proposal creation.
  • Furuno - Furuno, a marine electronics leader, is applying an ensemble AI model, enhanced by LLMs, to integrate invaluable fishermen's domain knowledge for sustainable fishing practices. This innovative approach tackles challenges like limited data and edge deployment, improving fish identification and operational efficiency on vessels.
  • FuzzyLabs - FuzzyLabs, an MLOps consultancy, developed an autonomous SRE agent using Anthropic's Claude and a custom FastMCP client to automate cloud infrastructure incident diagnosis, integrating with Kubernetes, GitHub, and Slack for end-to-end troubleshooting and reporting. This proof-of-concept demonstrates significant LLMOps optimizations like tool caching for cost reduction, while acknowledging ongoing challenges in effectiveness, security, and cost optimization for production readiness.
  • Gardenia Technologies - Gardenia Technologies developed Report GenAI, an agentic AI solution built on Amazon Bedrock, to automate complex ESG reporting by integrating multi-modal data sources for enterprise sustainability compliance. This production-ready system, showcasing advanced LLMOps practices, enabled Omni Helicopters International to reduce their CDP reporting time by 75%.
  • GEICO - GEICO implemented Retrieval Augmented Generation (RAG) for its customer service LLMs to combat hallucinations and "overpromising." They developed the novel "RagRails" approach, which guides LLM responses with specific instructions in retrieved context, significantly improving accuracy and enabling reliable deployment in a regulated insurance environment.
  • Geminus - Geminus develops AI-driven digital twins for industrial infrastructure, using a hybrid approach of synthetic and real data to train ML models that reduce development time from months to days. Their system acts as a trusted advisor to human operators, providing optimization insights while integrating with existing control systems and ensuring safety in high-stakes environments.
  • Georgia-Pacific - Georgia-Pacific, a forest products manufacturer, implemented an "Operator Assistant" using RAG and AWS Bedrock to bridge critical knowledge gaps for new employees operating complex machinery, significantly improving operational efficiency and reducing waste across 45 facilities.
  • Github - GitHub developed and continuously evolved robust evaluation systems for Copilot, their AI code completion and chat tool, leveraging methods from objective unit test-based validation with "harness lib" to LLM-as-judge for conversational quality, alongside A/B testing and algorithmic evaluations. This systematic approach was instrumental in transforming Copilot into a reliable production system.
  • GitHub - GitHub built Copilot, a global code completion service that handles hundreds of millions of daily requests with sub-200ms latency using LLMs. Its custom proxy architecture optimizes for performance, reliability, and security through innovations like HTTP/2, multi-region deployment, and efficient request cancellation.
  • Github - Github developed and scaled Copilot secret scanning, an AI-powered system leveraging LLMs to detect generic passwords in code repositories, overcoming the limitations of traditional regex. This was achieved through advanced prompt engineering, rigorous testing, and sophisticated resource management, resulting in a 94% reduction in false positives and successful enterprise-scale deployment.
  • Glean - Glean implements enterprise search and RAG applications by developing custom, fine-tuned embedding models for each customer. Their solution combines traditional and semantic search, leveraging continued pre-training on company-specific language and LLM-generated synthetic data to continuously improve search quality.
  • GoDaddy - GoDaddy enhanced its product categorization system for 6 million items by leveraging Amazon Bedrock's batch inference capabilities with Claude and Llama 2 models, achieving 97% category coverage and an 8% cost reduction through optimized prompt engineering and robust LLMOps practices.
  • Google - Google Research developed a hybrid LLM-optimization system for its "AI trip ideas in Search" feature, leveraging Gemini models for initial itinerary generation and a two-stage optimization algorithm to incorporate real-world constraints. This ensures practical and feasible travel plans, demonstrating a robust LLMOps approach for deploying LLMs in production for complex planning tasks.
  • Google - Google has developed a three-generation AI system, culminating in the use of its Veo video generation model, to transform 2D product images into interactive 3D visualizations for Google Shopping. This advanced system, fine-tuned on synthetic 3D assets, generates realistic 360-degree product spins from as few as one to three input images, significantly enhancing the online shopping experience and making 3D visualization scalable.
  • Google Deepmind - Google Deepmind's Deep Research is an autonomous AI agent, powered by Gemini 1.5 Pro, designed to conduct comprehensive web research and provide deep topic understanding. It employs a robust asynchronous platform and sophisticated orchestration to manage long-running tasks, iterative planning, and hybrid context management for efficient information retrieval.
  • Grab - Grab implemented an enterprise-scale AI Gateway to centralize and streamline access to multiple GenAI providers for its internal developers. This system provides a unified API, robust cost management, and enhanced security, enabling over 300 diverse LLM-powered use cases across the organization.
  • Gusto - Gusto, a company specializing in payroll and HR solutions, developed a practical method to mitigate LLM hallucinations in customer support by using token log-probabilities as a confidence metric. This approach allows them to automatically filter low-quality responses, achieving a 76% accuracy for high-confidence outputs compared to 45% for low-confidence ones.
  • HackAPrompt / LearnPrompting - Sandra Fulof's work with HackAPrompt and LearnPrompting addresses critical LLMOps challenges by creating the first AI red teaming competition platform and comprehensive prompt engineering educational resources. HackAPrompt collected 600,000 prompts, becoming a standard dataset for AI companies, and revealed that traditional defenses are ineffective against prompt injection attacks in production.
  • Harvey - Harvey, a legal AI company, deploys large language models at enterprise scale for legal professionals, serving top law firms with a platform for document analysis and drafting. Their unique "lawyer in the loop" methodology integrates domain experts throughout development, supported by a multi-layered evaluation strategy and custom tooling to ensure accuracy in this high-stakes environment.
  • Harvey - Harvey, a legal AI company, has built a robust LLMOps framework for its AI tools, addressing the legal domain's unique complexity through a "lawyer-in-the-loop" development philosophy and a multi-modal evaluation system that includes custom benchmarks like BigLawBench. This systematic approach has enabled them to achieve significant market penetration and accelerate product iteration for legal professionals.
  • HDI - HDI, a German insurance company, developed and optimized a production-grade RAG system to empower customer service agents with natural language access to complex insurance documents. This LLM-based solution, built on AWS and OpenSearch, achieved an 88% recall rate and 6ms query latency, significantly improving information retrieval efficiency.
  • Hexagon - Hexagon implemented a secure, production-grade enterprise AI assistant for its EAM products, deploying an open-source LLM on custom Amazon EKS infrastructure with RAG to ensure data privacy, reliability, and accurate, context-aware responses.
  • Hitachi - Hitachi is evolving its industrial AI solutions by integrating generative AI and LLMs into critical processes like fleet maintenance and automated fault tree extraction. This approach addresses unique industrial challenges such as small data and high reliability, leveraging domain-specific models to augment traditional AI.
  • HubSpot - HubSpot engineered a production-ready CRM integration for ChatGPT, becoming the first CRM provider to implement the Model Context Protocol (MCP) for enterprise AI agent integration. This involved building a custom Java-based MCP server with OAuth-based user permissions, a distributed service discovery system, and a specialized query DSL to enable secure, scalable AI-driven CRM searches for over 250,000 businesses.
  • Hugging Face - Hugging Face developed a production-ready Model Context Protocol (MCP) server to seamlessly integrate AI assistants with its vast ecosystem of models and applications. This involved navigating rapid protocol evolution and making key architectural decisions, including adopting Streamable HTTP and a stateless design for scalable, customizable tool access with integrated authentication and resource management.
  • IBM - IBM's Watson X platform offers a comprehensive enterprise LLMOps solution, providing access to diverse models and emphasizing customization and innovative API design to optimize LLM consumption for regulated industries.
  • IBM Research - IBM Research developed the open-source BeeAI Framework, a TypeScript-based library, to enable full-stack developers to build and deploy production-ready LLM-powered AI agents. The framework focuses on production-grade reliability and transparency, even demonstrating advanced reasoning capabilities with open-source LLMs like Llama 3-70B-Chat.
  • IBM Research / The Zig / Augmented AI Labs - This panel discussion, featuring experts from IBM Research, The Zig, and Augmented AI Labs, delves into the practical challenges of deploying AI agents in enterprise production environments. They share insights on managing costs at scale, implementing human-in-the-loop oversight, navigating the gap between prototypes and production realities, and the importance of evolving technical standards like Agent Communication Protocol.
  • IDIADA - IDIADA optimized its production LLM chatbot, AIDA, by implementing multi-model classification, leveraging embedding-based SVM and ANN models with Cohere embeddings to intelligently route diverse requests. This systematic approach achieved 95% routing accuracy and a 20% increase in team productivity.
  • Impel - Impel, an automotive retail AI company, migrated its Sales AI product to a fine-tuned Meta Llama model deployed on Amazon SageMaker, moving from a third-party LLM. This transition delivered 20% improved accuracy across personalized customer engagement features, while also enhancing cost predictability, security, and operational control.
  • INRIX - INRIX, a transportation intelligence company, partnered with Caltrans to develop an AI-powered solution leveraging Amazon Bedrock and generative AI to identify high-risk road locations and rapidly visualize safety countermeasures. This system utilizes RAG with Claude models and Nova Canvas for image generation, significantly reducing transportation planning design cycles from weeks to days.
  • Instacart - Instacart developed the LLM-Assisted Chatbot Evaluation (LACE) framework to systematically assess its AI-powered customer support chatbot, leveraging multi-dimensional LLM-based evaluation methods, including agentic debate, to ensure quality. This framework enables continuous monitoring and targeted improvements by identifying nuanced issues like context maintenance failures and inefficient responses.
  • Intel / Lmsys - Intel PyTorch Team collaborated with the SGLang project to enable cost-effective CPU-only deployment of large Mixture of Experts (MoE) models like DeepSeek R1 on Intel Xeon 6 processors, leveraging AMX and highly optimized kernels for attention and MoE computations. This solution achieves significant speedups (6-14x TTFT, 2-4x TPOT) over existing CPU frameworks, making large LLM deployment more accessible by supporting various quantization formats.
  • Intercom - Intercom developed Finn, an autonomous AI customer support agent, scaling it from prototype to production by leveraging GPT-4 and a complex, multi-process architecture including custom RAG and ranking. This robust system successfully increased resolution rates from 25% to nearly 60%, demonstrating effective LLM deployment for real-world customer interactions.
  • Intercom - Intercom scaled Fin, their AI customer support chatbot, to production, handling over 13 million conversations for 4,000+ customers with high resolution rates. This was achieved by leveraging multiple LLM providers, including Amazon Bedrock, and implementing advanced reliability engineering practices like cross-region inference and model fallbacks.
  • Intuit - Intuit implemented a GenAI-powered dual-loop system for automated technical documentation management, leveraging LLMs for continuous document improvement via analysis, enhancement, and RAG-based augmentation, alongside semantic search and answer synthesis for enhanced knowledge retrieval.
  • J.P. Morgan Chase - J.P. Morgan Chase's Private Bank developed "Ask David," a multi-agent AI system leveraging RAG and specialized agents to automate investment research and provide real-time insights. Despite its advanced architecture, the system incorporates human-in-the-loop processes, acknowledging the critical need for human oversight in high-stakes financial decisions.
  • JonFernandes - Independent AI engineer Jonathan Fernandez developed a production-ready RAG stack through 37 iterations, demonstrating a robust approach for financial services requiring on-premises deployment. This sophisticated system leverages LlamaIndex, Qdrant, and open-source models to deliver accurate, monitored question-answering capabilities.
  • LiftOff LLC - LiftOff LLC evaluated self-hosting DeepSeek-R1 models on AWS EC2 to reduce reliance on commercial AI services, but found that despite technical feasibility, the operational costs and performance challenges of larger models made it economically unviable for startup-scale operations compared to SaaS alternatives.
  • LinkedIn - LinkedIn's JUDE platform operationalizes fine-tuned LLMs at scale to generate high-quality semantic embeddings for job recommendations, significantly improving key metrics like qualified applications. This comprehensive LLMOps implementation demonstrates how to deploy LLMs in high-stakes, real-time environments for over a billion users.
  • LinkedIn - LinkedIn transformed its job search for 1.2 billion members into an AI-powered semantic system, leveraging a multi-stage LLM architecture with model distillation, GPU-optimized exhaustive search, and synthetic data generation to enable nuanced natural language queries and deliver highly relevant results at scale.
  • LinkedIn - LinkedIn developed a comprehensive Python-based platform to deploy multi-agent systems at scale, strategically pivoting from Java to leverage the generative AI ecosystem. Their LinkedIn Hiring Assistant, a production agent, showcases this architecture by automating recruiter workflows through a supervisor multi-agent design and specialized infrastructure for agent communication and memory.
  • LinkedIn - LinkedIn developed a collaborative prompt engineering platform, leveraging Jupyter Notebooks and LangChain, to enable both technical and non-technical teams to build and iterate on production-grade LLM features. This platform streamlined prompt management, data integration, and version control, leading to successful deployments like AccountIQ, which drastically cut research time.
  • Loka / Domo - Loka and Domo showcase advanced agentic AI systems in production, with Loka's drug discovery assistant orchestrating multiple specialized AI models and databases for complex scientific workflows, and Domo applying agentic solutions to automate business intelligence and financial analysis with human oversight.
  • Love Without Sound - Love Without Sound built a production-grade AI system, leveraging NLP and specialized LLMs, to standardize music metadata and recover billions in unallocated royalties. This modular, data-private solution processes vast datasets in real-time, demonstrating effective LLMOps practices and delivering significant financial recovery for artists.
  • Luna - Luna, a project management AI company, developed an AI-powered Jira analytics system using GPT-4 and Claude 3.7, uncovering critical lessons for production LLM reliability, including the paramount importance of data quality, explicit temporal context, and chain-of-thought prompting.
  • MaestroQA - MaestroQA enhanced its customer service quality assurance platform by integrating Amazon Bedrock, enabling sophisticated analysis of millions of customer interactions through open-ended queries. This flexible solution leverages multiple foundation models to improve compliance detection and sentiment analysis for enterprise clients.
  • Manulife - Manulife implemented a production-grade RAG system to enhance call center operations, innovatively handling both structured and unstructured data sources by directly embedding structured data, which significantly reduced response times and improved CSR efficiency.
  • Merantix - Merantix implements LLM and AI systems in production, emphasizing human-AI synergy and progressive automation across domains like pharmaceutical research and document processing. Their solutions, which leverage foundation models, learn from human input to gradually achieve autonomous operation while maintaining high accuracy in critical applications.
  • Meta - Meta Reality Labs developed a production edge AI system for Ray-Ban Meta smart glasses, utilizing a four-part architecture to deliver real-time multimodal processing. This system overcomes significant wearable AI challenges like power and thermal management, enabling advanced conversational AI and contextual awareness powered by Meta's multimodal models.
  • Meta - Meta's AI infrastructure team developed a sophisticated LLM serving platform, leveraging techniques like continuous batching, distributed inference, and hierarchical caching, to efficiently power Meta AI, smart glasses, and extensive internal ML workflows at scale. This comprehensive approach addresses the complex challenges of productionizing large language models, ensuring high performance and cost efficiency.
  • Meta - Meta engineered a sophisticated AI system for automatic video translation and lip-syncing at scale, orchestrating multiple AI models, including its Seamless universal translator, to preserve original voice characteristics and emotions. This robust solution has significantly boosted content impressions by expanding language accessibility.
  • Meta - Meta details its comprehensive strategy for scaling AI infrastructure to support massive LLM training and inference, detailing innovations in distributed systems, data center design, and power management to handle a GPU fleet rapidly growing past 100,000 units.
  • Meta - Meta transformed its global backbone network to manage the unprecedented demands of scaling AI workloads, which caused over 100% year-over-year growth in cross-region data traffic. By optimizing data placement, improving caching, and expanding network capacity, Meta significantly reduced cross-region reads and built a more resilient infrastructure for its global AI systems.
  • Meta - Meta implemented an AI-assisted root cause analysis system, powered by a fine-tuned Llama model and a unique election-based ranking approach, to accelerate incident investigations and enhance context building for responders in their large-scale monorepo.
  • Meta - Meta successfully deployed an AI-powered image animation feature across its apps, demonstrating advanced LLMOps practices for generative AI at scale. This involved sophisticated model optimizations, such as combined distillation and precision reduction, alongside robust infrastructure strategies like regional traffic management and intelligent GPU resource allocation, to efficiently serve billions of users.
  • Meta - Meta developed AI Lab, a pre-production framework for continuous performance testing and optimization of ML workflows, specifically focusing on Time to First Batch (TTFB). This systematic approach enabled up to a 40% reduction in TTFB and proactively prevented performance regressions in their ML infrastructure.
  • Meta / Google / Monte Carlo / Microsoft - A panel of experts from Meta, Google, Monte Carlo, and Microsoft discussed the profound infrastructure challenges of deploying autonomous AI agents in production, emphasizing how their multi-step, non-deterministic nature demands entirely new approaches to networking, security, observability, and evaluation compared to traditional software.
  • Microsoft - Microsoft's AI platform team has engineered a highly optimized network architecture and communication infrastructure to support the training and inference of massive large language models across hundreds of thousands of GPUs. Their innovations, including rail-optimized cluster designs and smart communication libraries like TAL, enable industry-leading performance and cost-effective scaling for cutting-edge AI workloads.
  • Microsoft - Microsoft successfully implemented LLMOps for its LLM applications within a highly restricted network environment, leveraging Azure Machine Learning and Prompt Flow. They tackled the challenge of lengthy evaluation pipelines by introducing an innovative opt-out mechanism, ensuring secure and efficient model deployment while maintaining evaluation rigor.
  • Microsoft - Microsoft engineered an enterprise-scale framework for deploying and managing generative AI projects in production, leveraging LLMOps and Infrastructure as Code to automate setup and significantly reduce project initiation time from weeks to hours while ensuring consistent security and compliance.
  • Microsoft - To tackle the challenge of manually processing high volumes of customer feedback, a retail organization deployed an LLM-based system, built with Azure OpenAI, that automates theme and sentiment extraction, providing consistent, actionable insights through careful prompt engineering and data pipeline design.
  • Microsoft - Microsoft optimized its production multimodal RAG system to effectively answer domain-specific queries using both text and image content, achieving improved retrieval and generative accuracy by strategically employing GPT-4V for image enrichment and GPT-4o for inference.
  • Microsoft - Microsoft's ISE team advises against "unearned complexity" in production LLM systems, highlighting how premature adoption of multi-agent architectures and frameworks like LangChain can introduce significant reliability, debugging, and security challenges. They advocate for starting with simpler, explicit designs and only adding complexity when clearly justified, emphasizing careful dependency and version management.
  • Microsoft - Microsoft developed a robust evaluation system to ensure product image integrity in AI-generated advertising content, combining traditional computer vision techniques like template matching and MSE with deep learning-based cosine similarity. This solution enables scalable 1:1 ad personalization by verifying that generative AI models do not inadvertently modify original product representations.
  • Mistral - Mistral, an AI company, focuses on building and deploying enterprise-grade LLMs, offering comprehensive solutions from custom fine-tuning to on-premise deployment and efficient inference optimization. Their experience highlights the critical importance of data processing, infrastructure stability, and tailored solutions for bringing LLMs to production at scale.
  • Modal - Modal engineered a robust production system for generating aesthetically pleasing, scannable QR codes by implementing comprehensive evaluation systems and inference-time compute scaling. This approach, which involved automated evaluation and generating multiple QR codes in parallel to select the best, enabled them to achieve a 95% scan rate service-level objective while maintaining aesthetic quality.
  • Monday.com - Monday.com, a work OS platform, developed a digital workforce using multi-agent AI systems built on LangGraph and LangSmith to automate tasks at scale. Their production strategy emphasizes user trust through features like previews and guardrails, leading to significant month-over-month growth in AI usage.
  • Monday.com - Monday.com built a digital workforce of AI agents using LangGraph to manage their billion annual work tasks, prioritizing user trust and control through granular autonomy and human-in-the-loop previews. This multi-agent system, including their "Monday Expert," has seen 100% month-over-month AI usage growth by emphasizing explainability and robust production guardrails.
  • Morgan Stanley / Grab - Morgan Stanley and Grab successfully deployed LLM and GenAI solutions by adopting evaluation-driven LLMOps, with Morgan Stanley optimizing its RAG-based internal document search for financial advisors and Grab enhancing mapping accuracy through advanced computer vision and vision fine-tuning. Both companies demonstrated that starting with simple evaluation frameworks and progressively scaling them is key to rapid iteration and significant performance improvements.
  • Moveworks / NVIDIA - Moveworks optimized their enterprise Copilot's production latency and throughput by integrating NVIDIA's TensorRT-LLM engine, achieving significant performance gains such as a 2.3x increase in token processing speed and a 2.35x reduction in average request latency, which enabled more responsive and scalable conversational AI.
  • Neon - Neon, a serverless Postgres provider, implemented a comprehensive evaluation framework to ensure LLMs reliably select tools from a large set for complex database migration workflows. This framework, leveraging "LLM-as-a-judge" scoring and database integrity checks, enabled them to achieve a 100% tool selection success rate through iterative prompt engineering.
  • Netflix - Netflix developed an automated pipeline using LLMs to generate show and movie synopses, significantly streamlining a previously manual process. Orchestrated by Metaflow, this system integrates LLM-based content summarization and synopsis generation with robust human-in-the-loop quality control and LLM-as-judge evaluation, augmenting creative professionals and boosting efficiency.
  • Netflix - Netflix is enhancing its large-scale entertainment knowledge graph, a foundational system for content understanding and recommendations, by integrating LLMs to infer complex relationships and entity types from unstructured data. This allows for more sophisticated content analysis and enrichment, leveraging a hybrid architecture orchestrated by Metaflow.
  • Netflix - Netflix developed a foundation model for its large-scale personalized recommendation system, adapting LLM-inspired techniques to process hundreds of billions of user interactions. This unified approach addresses the complexity of multiple models, improving recommendation quality while meeting strict production latency and cold-start requirements.
  • Netsertive - Netsertive, a digital marketing solutions provider, implemented an AI-powered call intelligence system using Amazon Bedrock and Amazon Nova Micro to automate the analysis of customer call tracking data, addressing the time-consuming manual review process. This solution processes real-time call transcripts and performs aggregate analysis, delivering actionable insights like sentiment, summaries, and coaching suggestions, significantly reducing analysis time from days to minutes.
  • NewDay - NewDay, a UK financial services company, deployed NewAssist, a generative AI agent assist chatbot leveraging RAG on AWS serverless with Claude 3 Haiku, dramatically reducing customer service answer retrieval time from 90 to 4 seconds and achieving over 90% accuracy through iterative data quality optimization and user-centric refinement.
  • Nimble Gravity / Hiflylabs - A joint study by Nimble Gravity and Hiflylabs provides insights into implementing multi-agent LLM systems in production, detailing architectures such as orchestrator and agent-to-agent patterns, and highlighting a customer service automation use case that delivered $1M in annual savings.
  • Notion AI - Notion AI, serving over 100 million users with its AI-powered workspace features, prioritizes rigorous evaluation and observability, dedicating 90% of its AI development time to these areas. Leveraging platforms like Brain Trust and custom LLM-as-a-judge systems, Notion ensures product reliability, supports rapid model switching, and addresses complex multilingual challenges across its diverse AI product suite.
  • Nubank - Nubank, a major digital bank, integrated large-scale transformer-based foundation models into its AI platform to enhance predictive banking decisions by processing sequential customer data. This initiative, accelerated by the Hyperplane acquisition, achieved an average 1.20% AUC lift across benchmark tasks and successfully deployed these models to production, serving over 100 million customers while preserving existing governance.
  • Nubank - Nubank, a leading bank serving 120 million users, has built an AI private banker using large-scale LLM systems for customer service and agentic money transfers, significantly enhancing efficiency and user experience. Leveraging LangChain, LangGraph, and LangSmith, their robust LLMOps infrastructure includes an advanced LLM-as-a-judge evaluation system that achieves near-human accuracy for critical financial operations.
  • NVIDIA - NVIDIA optimized its internal employee support AI agent by implementing a data flywheel approach, fine-tuning smaller models (1B-8B) to achieve 94-96% routing accuracy, matching larger 70B models while delivering 98% cost savings and 70x lower latency. This continuous optimization loop ensures the agent efficiently routes employee queries across various enterprise domains.
  • NVIDIA - NVIDIA developed an innovative system that leverages the DeepSeek-R1 LLM in a closed-loop, iterative process to automatically generate and optimize GPU kernels for attention mechanisms, achieving high success rates on benchmarks by incorporating "inference-time scaling" and robust verification.
  • OfferUp - OfferUp enhanced its local search capabilities by migrating from traditional keyword-based search to a multimodal AI system, leveraging Amazon Bedrock's Titan Multimodal Embeddings and OpenSearch Service to process both text and images into vector embeddings. This transformation significantly improved search relevance and user engagement, leading to a 54% reduction in geographic spread for more local results and a 6.5% increase in search depth.
  • Onity Group - Onity Group, a mortgage servicing company, deployed an intelligent document processing solution using Amazon Bedrock's multimodal foundation models to automate the handling of millions of complex legal and financial documents. This system, which dynamically routes tasks between Amazon Textract and Bedrock, achieved a 50% reduction in extraction costs and a 20% improvement in accuracy.
  • OpenAI - OpenAI successfully scaled its ChatGPT Images feature to an additional 100 million users in one week, generating 700 million images, by rapidly re-architecting its synchronous image generation system to an asynchronous one while in production. This massive scaling effort demonstrated the importance of robust system isolation, resource management, and pragmatic engineering in deploying large-scale GenAI applications.
  • OpenAI - OpenAI has evolved its AI agent development, shifting from manually designed workflows to end-to-end trained agents that leverage reinforcement learning for products like Deep Research, Operator, and Codeex CLI. This approach enables their agents to discover more robust solutions and recover effectively from failures in complex production environments.
  • OpenAI - OpenAI successfully developed and deployed GPT-4.5, a frontier large language model, by overcoming unprecedented scaling challenges in LLMOps, including coordinating tens of thousands of GPUs and fostering deep co-design between ML and systems teams.
  • OpenPipe - OpenPipe's ART·E project showcases advanced production LLMOps by developing a specialized LLM agent for email search using reinforcement learning and synthetic data. This agent outperforms general-purpose models like OpenAI's o3 in accuracy, speed, and reliability, demonstrating a cost-effective approach to building domain-specific AI.
  • OpenRouter - OpenRouter developed a multi-model LLM API marketplace and infrastructure platform to tackle the fragmentation and operational complexities of deploying diverse language models in production. It normalizes APIs across hundreds of models, offering intelligent routing, custom middleware, and performance optimizations for developers to efficiently leverage a multi-LLM strategy.
  • OpenRouter - OpenRouter developed a multi-model LLM marketplace and routing platform, serving as a unified API gateway for over 400 models from 60+ providers. This infrastructure addresses LLMOps challenges by normalizing diverse APIs, intelligently routing requests for optimal performance and uptime, and enabling model enhancements through a unique middleware system.
  • Orbital - Orbital, a real estate technology company, leverages an agentic AI system, Orbital Co-pilot, to automate legal due diligence by processing billions of tokens monthly. Their experience highlights the concept of "prompt tax," the significant operational overhead of continuously migrating and optimizing over a thousand domain-specific prompts across rapidly evolving LLM architectures.
  • Orbital Materials / Hum.AI - Climate tech startups, including Orbital Materials and Hum.AI, are leveraging Amazon SageMaker HyperPod to develop specialized foundation models for environmental AI applications. These companies train custom models from scratch on massive environmental datasets, enabling advancements like tenfold performance improvements in carbon capture materials and the ability to analyze underwater ecosystems from satellite imagery.
  • OSRAM - OSRAM, a manufacturing technology provider, implemented an LLM-powered chatbot on AWS using Amazon Bedrock and Claude to centralize and manage critical institutional knowledge. This solution leverages RAG and a hybrid search approach to provide accurate, context-aware responses, significantly improving access to technical documentation and achieving over 85% accuracy.
  • Outropy - Outropy developed an AI-powered Chief of Staff system, scaling its agent architecture from a monolith to a distributed system by addressing complex LLMOps challenges like state management, event processing, and API rate limits, ultimately leveraging Temporal for robust workflow orchestration.
  • Outropy - Outropy, while building an AI-powered assistant for engineering leaders, evolved its LLM inference pipeline architecture from monolithic to robust task-oriented pipelines, leveraging Temporal for reliable workflow management and scaling to 10,000 users.
  • Patch - Patch implemented an AI-powered system for local news generation, enabling them to scale coverage to 30,000 communities. This system leverages sophisticated data aggregation and pre-processing from diverse verified sources to produce community-specific newsletters, achieving high user satisfaction and maintaining editorial quality.
  • Pattern - Pattern's Content Brief system leverages LLMs and AWS services like Amazon Bedrock to process trillions of ecommerce data points, providing actionable insights for optimizing product listings and driving significant revenue and traffic improvements.
  • Pinterest - Pinterest improved its ads engagement modeling by deploying a Multi-gate Mixture-of-Experts (MMoE) architecture, which, combined with mixed precision inference and knowledge distillation, significantly reduced inference latency by 40% while enhancing model performance.
  • Pinterest - Pinterest improved its search relevance by implementing a large language model (LLM) teacher-student architecture, using knowledge distillation to train a lightweight model for production serving. This approach led to a 2.18% improvement in search feed relevance and increased fulfillment rates, successfully generalizing across multiple languages.
  • Pinterest - Pinterest significantly enhanced its Homefeed recommendation system by deploying advanced embedding-based retrieval techniques, including sophisticated feature crossing and novel multi-embedding and conditional retrieval approaches, which led to measurable gains in user engagement.
  • Pinterest - Pinterest deployed a large-scale learned retrieval system using a two-tower architecture to enhance content recommendations for its 500M+ users. This system, featuring automated retraining and careful version synchronization, successfully replaced heuristic methods, significantly improving user engagement and content discovery.
  • Prolego - Prolego's engineering team details the practical challenges of building production-ready Retrieval Augmented Generation (RAG) systems, highlighting the complexities of document processing, chunking strategies, and robust evaluation methods. Their insights demonstrate that successful RAG implementation requires addressing numerous technical nuances beyond typical tutorial examples.
  • Propel - Propel, a company serving 5 million monthly SNAP users, deployed AI-powered tools to reduce benefit interruptions, leveraging LLMs for both code generation in a structured triage system and a nationwide conversational AI assistant powered by Decagon. These solutions demonstrated faster benefit restoration and improved user experience, showcasing practical LLMOps in government services.
  • Propel - Propel developed a comprehensive, automated LLM evaluation framework for handling SNAP (Supplemental Nutrition Assistance Program) benefit inquiries, ensuring accuracy and accessibility in a critical public service domain. This dual-purpose framework leverages Promptfoo for automated testing and employs AI models as judges for complex response evaluation, addressing challenges like knowledge cutoff with RAG for robust production deployment.
  • Propel - Propel is developing a systematic evaluation framework for LLMs to ensure accuracy and safety in high-stakes SNAP benefits administration. Their approach includes a custom testing infrastructure, like a Slackbot for comparing frontier LLMs, and integrates domain expertise to build nuanced evaluation criteria for responsible GenAI deployment.
  • Propel - Leveraging Anthropic's Claude 3.5 Sonnet, Propel is developing an AI-powered system to help SNAP recipients interpret complex government notices. This solution provides clear, actionable guidance, demonstrating a careful approach to deploying LLMs in a high-stakes environment where user outcomes directly affect essential benefits.
  • ProPublica - ProPublica, a nonprofit investigative journalism organization, responsibly integrated LLMs into its workflows to efficiently analyze large datasets, such as National Science Foundation grants, by employing careful prompt engineering and human-in-the-loop verification to ensure accuracy and journalistic integrity.
  • Prosus - Prosus developed the "Token Data Analyst" agent, an LLM-powered SQL query generator, to democratize data access across its portfolio companies. This system significantly reduced query response times by 74% and increased data insights, enabling non-technical users to retrieve information from various databases via natural language.
  • Prosus - Prosus engineered production web agents powered by LLMs to automate complex e-commerce interactions, specifically for food ordering, on websites lacking traditional APIs. This involved a modular architecture separating planning and execution, enabling reliable navigation and task completion with an 80% success rate.
  • Prosus / Google / Canonical - This case study illuminates the complex production challenges of deploying voice AI agents powered by LLMs, highlighting the necessity for real-time processing, rigorous testing across diverse linguistic and emotional contexts, and nuanced prompt engineering for successful user interaction.
  • Qovery - Qovery developed an agentic DevOps copilot, evolving from basic intent mapping to a dynamic agentic system with resilience and conversation memory, which now leverages Claude Sonnet 3.7 to autonomously execute complex infrastructure tasks and optimize DevOps workflows.
  • Qualtrics - Qualtrics developed Socrates, a sophisticated ML platform built on SageMaker and Bedrock, to operationalize AI and LLMs at scale for experience management. This unified system supports the entire ML lifecycle, from development to production, and has delivered substantial cost savings and performance gains for generative AI workloads.
  • QuantumBlack - QuantumBlack developed AI4DQ Unstructured, a toolkit designed to enhance data quality for generative AI applications by tackling unstructured data challenges. Leveraging advanced NLP, custom embeddings, and a comprehensive quality assessment framework, the solution significantly improved RAG pipeline accuracy and reduced data storage costs.
  • Quic - Quic deployed over 30 GenAI agents in production for customer experience, leveraging RAG, API integration, and LLM-based testing to achieve a 60% resolution rate for tier-one support issues with higher quality than human agents.
  • Quora - Quora developed Poe as a unified platform, akin to a "web browser for AI," providing consumers with access to multiple large language models and custom AI agents. This multi-model architecture addresses complex LLMOps challenges, enabling creators to build and monetize diverse AI applications at scale.
  • QyrusAI - QyrusAI has built an AI-powered shift-left testing platform that orchestrates multiple specialized LLM agents, each leveraging different foundation models via Amazon Bedrock, to automate and enhance various stages of software testing, leading to significant reductions in defect leakage.
  • Ragas - Ragas presents a systematic evaluation-driven development methodology to help AI engineers rigorously improve LLM applications in production. This approach replaces subjective "vibe checks" with objective, data-driven iteration, leveraging techniques like LLM-as-judge systems and structured experimentation for continuous enhancement.
  • Ramp - Ramp, a fintech company, developed an open-source Model Context Protocol (MCP) server to enable natural language queries for their business financial data, initially exposing their RESTful API to conversational AI. This solution evolved to use an in-memory SQLite database with a SQL interface, significantly improving scalability and allowing LLMs to accurately analyze tens of thousands of spend events by leveraging their strong SQL generation capabilities.
  • Ramp - Ramp successfully deployed LLM-powered agents for automated expense management, achieving over 65% automated approvals by building user trust through transparent reasoning, explicit uncertainty handling via "escape hatches," and a collaborative context management system that allows users to refine policies. This approach enabled the company to integrate AI into a high-stakes financial environment while maintaining accuracy and user confidence.
  • Ramp - Ramp, a financial technology company, implemented an LLM-powered AI agent with multimodal RAG and vector embeddings to automate merchant classification for corporate card transactions. This solution reduced manual intervention from hours to under 10 seconds per request, achieving 99% accuracy in classifications and significantly cutting operational costs.
  • Ramp - Ramp, a financial technology company, built an in-house RAG system to standardize customer industry classification using NAICS codes, resolving inconsistencies and enhancing data quality for critical business functions. This solution leverages embeddings and a two-stage LLM prompting approach to deliver accurate and auditable classifications.
  • RBC - RBC developed Arcane, a RAG system, to streamline access and interpretation of complex investment policies for its financial specialists. This LLM-powered solution efficiently navigates vast, semi-structured documentation, reducing search time and ensuring consistent, compliant answers in a highly regulated environment.
  • Remitly - Remitly, a global financial services company, developed an AI-powered system leveraging LLMs and custom-engineered prompts to automate marketing compliance, analyzing content against regulatory guidelines to provide real-time feedback and significantly reduce review cycles.
  • Replit - Replit evolved its AI-powered autonomous coding agent, leveraging a multi-model architecture with Claude 3.5 Sonnet, to enable non-technical users to build complete software applications without writing code. This advancement from short-burst to extended runtime operations emphasizes performance and cost optimization, supported by robust observability for managing complex agent behaviors at scale.
  • Reuters - Reuters, a global news organization, has implemented a comprehensive AI strategy, including generative AI, to streamline content production and verification while upholding journalistic integrity. Their LLMOps approach integrates AI tools for tasks like fact extraction, CMS enhancements, and content packaging, all designed with human-in-the-loop processes and strict ethical guidelines.
  • RHI Magnesita - RHI Magnesita, a global mining and manufacturing company, implemented an AI agent to address $3 million in annual losses from customer service errors, leveraging LLMs to consolidate data from multiple systems and standardize order processing. This solution has significantly improved operational efficiency, error prevention, and evolved CSR roles into hybrid analyst positions.
  • Roblox - Roblox built a robust hybrid cloud AI infrastructure to unify fragmented ML efforts and scale large-scale inference for hundreds of production models. Leveraging Kubeflow, a custom feature store, and vLLM, their platform now efficiently processes billions of tokens weekly for AI assistants and handles a billion daily personalization requests.
  • Rocket Companies - Rocket Companies, a FinTech leader, deployed Rocket AI Agent, a conversational AI built on Amazon Bedrock Agents, to provide 24/7 personalized guidance for the complex home buying journey. This solution significantly boosted web-to-loan conversion rates threefold and drastically reduced customer service transfers by 85%, demonstrating the power of domain-specific AI in streamlining client engagement.
  • Rogo - Rogo has implemented a sophisticated multi-model LLM architecture in production to scale financial research and analysis for investment bankers and private equity firms. This tiered system, which includes GPT-4, intelligently routes tasks to optimize performance and cost, enabling significant time savings for analysts.
  • Roots - Roots, an insurance AI company, successfully deployed fine-tuned 7B Mistral models using the vLLM framework for specialized insurance document processing, achieving superior accuracy over generic models like GPT-4 on domain-specific tasks and significant performance gains, including 130 tokens/second throughput on A100 GPUs and cost-effective self-hosting for millions of documents annually.
  • Rosco - Rosco rebuilt its product around AI agents for enterprise data analysis, enabling them to query enterprise data warehouses by reasoning through discrete tool calls rather than relying on large context windows. Their production deployment emphasized secure, user-specific data access and optimized agent performance by refining the "Agent Computer Interface" and selecting Claude 3.5 for its balance of speed, cost, and decision-making quality.
  • Salesforce - Salesforce's AI Model Serving team leveraged Amazon SageMaker AI and Deep Learning Containers to build a robust LLMOps framework for high-performance LLM deployment, achieving up to 50% reduction in deployment time through optimized infrastructure and advanced inference techniques.
  • Samsung - Samsung is developing an autonomous semiconductor fabrication system using multi-modal LLMs and reinforcement learning to enhance manufacturing processes. This comprehensive LLMOps approach integrates diverse data types, from sensor readings to engineering notes, to automate equipment control, defect analysis, and process optimization.
  • Scotiabank - Scotiabank implemented a hybrid NLU and LLM-powered chatbot for customer service, utilizing an innovative "AI for AI" approach with custom ML models to automate the review and improvement of chatbot responses. This system, which includes LLM-powered conversation summarization for human agent handovers, achieved significant efficiency gains and marked the bank's first production use of generative AI.
  • Scoutbee - Scoutbee evolved its LLMOps architecture to successfully deploy LLMs in production for enterprise supplier discovery, tackling challenges like hallucinations and domain adaptation through techniques such as RAG, Chain of Thoughts, and custom guardrails.
  • Sentry - Sentry developed a Model Context Protocol (MCP) server, hosted on Cloudflare, to provide LLMs with real-time access to application monitoring data. This enables AI assistants to leverage 16 tool calls for tasks like error analysis and triggering Sentry's Seer AI agent, integrating critical context directly into development workflows for more effective debugging.
  • Shopify - Shopify's Augmented Engineering team developed Roast, an open-source framework that orchestrates structured AI workflows, to reliably apply AI agents to complex developer productivity challenges like code quality and test coverage by breaking tasks into manageable, deterministic steps.
  • Shopify - Shopify implemented and scaled Vision Language Models (VLMs) for large-scale product classification and understanding across its e-commerce platform, processing over 30 million predictions daily with an 85% merchant acceptance rate. This sophisticated system leverages a comprehensive product taxonomy and advanced optimization techniques like FP8 quantization and in-flight batching for efficient production deployment.
  • Skysight - Skysight demonstrated large-scale content classification on Hacker News, leveraging Small Language Models (SLMs) to efficiently process 40 million stories and billions of tokens. This case study highlights the practical viability of SLMs for cost-effective batch processing in production, challenging the common assumption that larger models are always required.
  • Snorkel - Snorkel developed an AI agent evaluation platform for commercial insurance underwriting, leveraging LangGraph and ReAct agents to simulate complex enterprise environments. This revealed significant challenges with frontier models, including a 36% tool use error rate, hallucinations of external domain knowledge, and a clear tradeoff between accuracy and computational cost.
  • Snorkel AI - Snorkel AI developed an agentic AI copilot for insurance underwriting, using LangGraph and multi-tool integration to assist junior underwriters. Their comprehensive benchmark revealed significant performance variations across frontier models, alongside common production challenges like tool use errors and domain-specific hallucinations, underscoring the complexities of deploying LLMs in specialized enterprise environments.
  • Snowflake - Snowflake optimized vLLM for high-throughput embedding inference on its Cortex AI platform, addressing CPU-bound bottlenecks in tokenization and serialization. By implementing a two-stage pipeline, optimized data serialization, and running multiple model replicas per GPU, they achieved up to 16x throughput gains and significant cost reductions.
  • Square - Square implemented a large-scale merchant classification system using the RoBERTa language model to accurately categorize tens of millions of merchants daily. This production-grade LLM solution, built with robust data pipelines and optimized for GPU inference, achieved a 30% improvement in classification accuracy and became central to Square's business operations.
  • Statista / Urial Labs / Talent Formation - Statista, a global data platform, engineered and optimized a production-grade RAG-based LLM system to significantly enhance its search and discovery capabilities. This systematic optimization, involving techniques like query rewriting and dynamic model selection, led to a 140% improvement in answer quality and a 65% reduction in operational costs.
  • StoryGraph - StoryGraph, a book recommendation platform, successfully scaled its LLM and ML infrastructure to handle 300M monthly requests by self-hosting, significantly reducing costs and enhancing control over its diverse AI features while prioritizing data privacy.
  • Stride / Aila Science - Stride developed an AI-powered text message-based healthcare treatment management system for Aila Science, leveraging LLM-powered agents and a sophisticated human-in-the-loop confidence scoring to automate patient interactions and achieve a 10x capacity increase.
  • Swisscom - Swisscom, Switzerland's leading telecommunications provider, implemented an AI-Powered Network Operations Assistant leveraging Amazon Bedrock and a multi-agent RAG architecture to automate complex data gathering and analysis for network engineers, significantly reducing manual effort and enhancing operational efficiency.
  • Tabs - Tabs, a vertical AI company, is developing a revenue intelligence platform for B2B sellers that leverages ambient AI agents to automate financial workflows. These agents operate autonomously in the background, utilizing a "commercial graph" to monitor communications and trigger actions, aiming to create self-operating financial software.
  • Taralli - Taralli significantly improved its LLM-powered food tracking and nutritional analysis system by implementing a systematic evaluation framework and optimizing with DSPy's BootstrapFewShotWithRandomSearch, boosting accuracy from an initial 17% to 76% using Gemini 2.5 Flash.
  • Telus - Telus developed Fuel X, an enterprise-scale LLM platform that centralizes the management of multiple AI models and services, enabling the creation of over 30,000 customized AI copilots for 35,000+ active users. This platform has delivered significant operational efficiencies, including a 46% self-resolution rate for internal support queries.
  • The Institute of Science Tokyo / National Institute of Advanced Industrial Science and Technology (AIST) - The Institute of Science Tokyo successfully developed Llama 3.3 Swallow, a 70-billion-parameter Japanese large language model, by leveraging Amazon SageMaker HyperPod and advanced distributed training techniques to achieve superior performance on Japanese benchmarks. This project showcases a comprehensive, production-ready LLMOps pipeline for large-scale model training.
  • Thinking Machines / Perplexity / Evolutionary Scale AI / Axiom - A multi-company panel featuring experts from Thinking Machines, Perplexity, Evolutionary Scale AI, and Axiom delved into the current LLMOps landscape, discussing the proliferation of agentic frameworks and the role of reinforcement learning in production LLM systems. They highlighted significant infrastructure and scaling bottlenecks, particularly for large models requiring hundreds of GPUs, and the challenges of tool calling in open-source models.
  • Thomson Reuters - Thomson Reuters implemented a robust LLMOps framework for its legal AI assistant CoCounsel, rigorously evaluating and deploying long-context LLMs for complex legal document analysis. Their multi-LLM strategy and extensive testing revealed that full document context often outperformed RAG for deep analysis, leading to a hybrid approach for handling lengthy legal texts.
  • Tokyo Electron - Tokyo Electron, a leader in semiconductor manufacturing equipment, is implementing LLM-powered Small Specialist Agents (SSAs) to optimize complex production processes, orchestrating domain-specific knowledge and planning to deploy smaller, specialized models for enhanced scalability and security in industrial environments.
  • Trae - Trae achieved state-of-the-art performance in automated software issue resolution, reaching 70.6% accuracy on the SWE-bench Verified benchmark. Their system orchestrates multiple LLMs, including Claude 3.7, Gemini 2.5 Pro, and OpenAI o4-mini, within a sophisticated multi-stage pipeline that generates, tests, and intelligently votes on candidate code patches.
  • TransPerfect - TransPerfect, a global leader in language solutions, implemented a production-grade LLM system powered by Amazon Bedrock to enhance translation workflows. This solution automates post-editing of machine translations and provides AI-assisted transcreation, leading to significant productivity gains for linguists and substantial cost savings.
  • Travelers Insurance - Travelers Insurance, in collaboration with AWS, implemented an automated email classification system using Anthropic's Claude models on Amazon Bedrock and Amazon Textract, achieving 91% accuracy through advanced prompt engineering to efficiently categorize millions of customer service emails.
  • Treater - Treater developed a multi-layered LLM evaluation pipeline for production content generation, leveraging deterministic checks, LLM-based quality assessments, and human feedback analysis to ensure high-quality outputs and continuous system improvement. This robust approach systematically reduces the gap between LLM-generated and human-quality content, demonstrating effective LLMOps for complex workflows.
  • Trellix - Trellix optimized its AI-powered security threat investigation system, Trellix Wise, by implementing a multi-model LLM strategy on Amazon Bedrock. This approach, combining Amazon Nova Micro and Claude Sonnet with a RAG architecture, achieved significantly faster inference and lower costs while maintaining high quality through a multi-pass technique.
  • Trellix / AWS - Trellix, in partnership with AWS, has implemented an AI-powered Security Operations Center (SOC) utilizing a sophisticated multi-agent LLM system on AWS Bedrock to autonomously investigate the overwhelming volume of security alerts. This agentic AI solution employs a tiered model strategy and dynamically correlates diverse security data, significantly improving threat detection and response efficiency by generating detailed incident reports.
  • Trunk - Trunk engineered an AI DevOps agent to reliably perform root cause analysis for CI test failures, addressing the inherent nondeterminism of LLM outputs. Their approach involved applying robust software engineering practices, including pragmatic model selection (switching to Gemini for deterministic tool calling) and comprehensive testing, resulting in a production-ready system that provides actionable insights to developers.
  • Uber - Uber's Developer Platform team built a suite of AI-powered developer tools using LangGraph, including Validator for code quality and AutoCover for automated test generation, to enhance productivity for its 5,000 engineers. These agentic solutions, deeply integrated into existing workflows, have led to thousands of daily code fixes and an estimated 21,000 developer hours saved.
  • Uber - Uber developed Genie, an internal on-call copilot, using an enhanced agentic RAG (EAg-RAG) architecture to provide accurate, real-time support for engineering security and privacy queries across Slack channels. This system significantly improved response quality by reimagining document processing and integrating LLM-powered agents for query optimization and context refinement, leading to a 27% increase in acceptable answers and a 60% reduction in incorrect advice.
  • Uber - Uber's developer platform team leverages LangGraph and multi-agent systems to build AI-powered tools like Validator for real-time code quality enforcement and AutoCover for automated test generation, significantly boosting developer productivity and code coverage for its 5,000 engineers.
  • Uber - Uber developed FixrLeak, a system leveraging generative AI (GPT-4) and AST analysis to automatically detect and fix Java resource leaks. Integrated into their development workflow, FixrLeak achieved a 91% success rate on eligible cases, significantly reducing manual intervention and enhancing code quality.
  • Uber / Microsoft - A comprehensive study analyzed over 2,000 prompt templates from production LLM applications, including those from Uber and Microsoft, revealing key design patterns that optimize performance and significantly reduce operational costs.
  • UC Berkeley - UC Berkeley researchers developed DocETL, a framework designed to help engineers build reliable LLM data processing pipelines by systematically addressing data understanding and intent specification before optimizing for accuracy, moving beyond common ad-hoc prompt iteration.
  • UniFi - UniFi developed an AI agent system for automated B2B research and sales pipeline generation, leveraging LLMs and evolving its architecture with advanced browser automation and deep internet research tools. This system processes billions of tokens monthly, achieving significant cost reductions through strategic model optimization and highlighting the need for human-in-the-loop evaluation for complex agent behaviors.
  • Untold Studios - Untold Studios, a visual effects and animation studio, deployed a secure AI assistant powered by Amazon Bedrock and Claude 3.5 Sonnet, integrated into Slack, to streamline artists' access to internal resources. This solution, leveraging RAG and custom function calling, has reduced information search time from minutes to seconds while maintaining strict security and reducing support team load.
  • Vericant - Vericant, an educational testing company, leveraged LLMs to develop an AI-powered video interview analysis system that reduced interview review time from 15 minutes to 20-30 seconds, achieved through iterative prompt engineering and systematic evaluation with minimal resources.
  • Verisk - Verisk, a leading data analytics company in the insurance industry, automated complex insurance policy review using a production-grade generative AI system. This RAG-based solution, leveraging Amazon Bedrock and Anthropic's Claude 3 Sonnet, reduced review time from days to minutes with high accuracy, demonstrating robust LLMOps practices.
  • Verisk - Verisk developed PAAS AI, a RAG-based generative AI assistant, to help insurance premium auditors efficiently navigate over 40,000 classification guides in a heavily regulated industry. Leveraging Amazon Bedrock with Claude, OpenSearch, and ElastiCache, the system achieved a 96-98% reduction in document processing time while ensuring accuracy.
  • Vimeo - Vimeo developed a production-grade video Q&A system that leverages LLMs, a multi-level RAG implementation for video transcripts, and an innovative speaker detection system to enable natural conversation with video content. This solution provides accurate, timestamped answers, making video content more accessible and interactive.
  • Wealthsimple - Wealthsimple, a Canadian FinTech, engineered a secure and scalable LLM gateway and platform to safely integrate generative AI within its regulated environment, utilizing self-hosted models, RAG, and a multi-provider strategy to boost employee productivity and optimize client service.
  • Wealthsimple - Wealthsimple, a Canadian fintech company, developed an internal LLM Gateway with custom PII redaction and multi-model support to securely integrate generative AI into its operations. This platform enables secure LLM usage for employee productivity, information retrieval, and operational improvements like automated customer support ticket triaging.
  • Wealthsimple - Wealthsimple, a Canadian financial services platform, developed a secure and scalable LLM gateway to enable enterprise-wide GenAI adoption, evolving from initial audit trails and PII redaction to incorporating self-hosted models, multimodal support, and cloud integrations like Amazon Bedrock. This comprehensive platform now serves over half the company, demonstrating successful and secure GenAI deployment.
  • Weights & Biases - Weights & Biases developed and optimized AI programming agents, achieving the top position on the SWEBench benchmark by applying systematic MLOps practices, including extensive experimentation and custom infrastructure like Weave and WB Launch.
  • Windsurf - Windsurf, an AI coding toolkit company, addresses the complex challenge of generating contextually relevant code by optimizing context selection rather than simply expanding context windows. Leveraging its GPU optimization expertise, Windsurf's system intelligently combines real-time user behavioral data with codebase information to deliver highly personalized and accurate code suggestions.
  • Windsurf - Windsurf developed an enterprise AI-powered software engineering platform, leveraging a custom IDE, multi-modal agent architecture, and advanced retrieval systems like Riptide, a custom LLM for code retrieval. This comprehensive system, which includes custom models and integrations across the full developer workflow, has achieved significant improvements in code acceptance rates and demonstrated frontier performance.
  • Windsurf - Windsurf, initially a GPU infrastructure provider, pivoted to create an AI-powered development environment that leverages LLMs and sophisticated evaluation systems to deliver advanced code completion and understanding for hundreds of thousands of users, including major enterprises.
  • Wix - Wix implemented Anna, a sophisticated multi-agent RAG system powered by LLMs, to simplify enterprise data discovery within their extensive data mesh architecture. This system innovatively uses RAG by embedding synthetic business questions, leading to an 83% success rate in retrieving relevant data dimensions for non-technical users.
  • Wix - Wix developed an LLM-based automation solution to update over 2,000 code samples in their Velo API documentation, using GPT-4 for classification and GPT-3.5 Turbo for conversion, validated by TypeScript compilation. This approach transformed weeks of manual work into a single morning, demonstrating high accuracy in code transformations.
  • Woowa Brothers - DeliveryHero's Woowa Brothers division developed an AI API Gateway to streamline GenAI service development and manage multiple LLM providers, including AWS Bedrock and Azure OpenAI. This centralized infrastructure standardizes access, reduces redundant development, and aims to democratize AI usage across the organization.
  • Writer - Writer, an enterprise AI company, delivers full-stack GenAI solutions for Fortune 500 clients, leveraging their Palmera models, a scalable graph-based RAG system, and self-evolving models designed for real-time adaptation and high accuracy in complex workflows. Their platform focuses on "action AI" for workflow automation in sectors like healthcare and finance, integrating seamlessly for both business and IT teams.
  • Yelp - Yelp implemented LLMs to enhance search query understanding, focusing on query segmentation and review highlights. Their systematic approach from POC to production involved a tiered model strategy, from GPT-4 for development to fine-tuned smaller models and BERT/T5 for production, optimizing for cost, latency, and quality to improve search relevance and user engagement.
  • Yuewen Group - Yuewen Group, a global online literature platform, leveraged Amazon Bedrock's automated Prompt Optimization feature to significantly enhance its LLM-based text processing. This solution improved character dialogue attribution accuracy to 90%, overcoming initial challenges where unoptimized LLM prompts underperformed traditional NLP models.
  • Zapier - Zapier developed Zapier Agents, an AI-powered platform enabling non-technical users to automate business processes, and discovered that building production AI agents necessitates a robust "data flywheel" for continuous improvement. This involved implementing comprehensive LLMOps, including detailed instrumentation, sophisticated feedback collection, and a hierarchical evaluation framework spanning unit tests to A/B tests, to manage the inherent non-determinism of AI systems.
  • Zed - Zed, an AI-enabled code editor, adapted its rigorous engineering practices to integrate non-deterministic LLM-powered "Agentic Editing" by developing a multi-layered testing strategy that combines stochastic and deterministic evaluations to ensure reliable code generation and editing.
  • ZURU Tech - ZURU Tech, a construction technology company, developed a text-to-floor plan generation system using LLMs, employing both prompt engineering with Claude 3.5 Sonnet on Amazon Bedrock and fine-tuning Llama models on Amazon SageMaker. This advanced LLMOps approach translates natural language descriptions into accurate floor plans, achieving significant improvements in instruction adherence and mathematical correctness.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Be first to deploy unified MLOps and LLMOps

Join the waitlist for early access to one platform for all your AI workflows.