Software Engineering

LLMOps in Production: Another 419 Case Studies of What Actually Works

Alex Strick van Linschoten
Dec 15, 2025
18 mins
Contents

Seventeen months ago, we began a project to document how companies were actually putting Large Language Models and GenAI workflows into production. We weren't interested in hype or Twitter demos; we wanted to find the engineering reality behind the buzz. What started as a collection of 300 entries has now grown into the largest curated repository of its kind.

Today, the ZenML LLMOps Database holds 1,182 production case studies.

This represents a massive corpus of engineering wisdom:

  • 7.3 million tokens of source material read and analyzed.
  • 4 million tokens of summaries generated.
  • 17 months of continuous curation.

In our previous updates (back in January and later in the spring), we shared batches of these summaries to help practitioners get a sense of the variety of approaches in the wild. Since our last update, we’ve added 419 new entries, covering everything from multi-agent architectures in manufacturing to HIPAA-compliant RAG systems in healthcare.

We know that navigating over a thousand case studies is a tall order. While we are currently working on a dedicated "State of LLMOps" analysis to bring you up to date with the recent months' developments — a deep dive into the patterns and insights from recent months that we will publish later this week — sometimes the raw data is just as valuable as the synthesis.

Below, we are publishing the latest batch of high-level summaries. These are short, scannable abstracts of the problems companies faced and the specific solutions they engineered. Whether you are looking for architectural inspiration or just curious about what your peers are building, this list offers a transparent window into the current state of the industry.

Here are the latest additions to the database:

  • 11X - 11X developed Alice, an AI Sales Development Representative, to automate personalized sales email outreach at a massive scale (50,000 emails/day). This was achieved by building an advanced RAG knowledge base that replaced a manual system, leveraging specialized parsing vendors (Llama Parse, Firecrawl, CloudGlue) for multi-modal content ingestion, strategic waterfall chunking, Pinecone for vector storage, and deep research agents for sophisticated context retrieval. This architecture enables Alice to generate highly contextual and personalized emails, significantly outperforming human SDR volume.
  • A large energy supplier - An energy utility faced high call volumes and long handling times for its technical help desk supporting field technicians. Infosys Topaz implemented an LLMOps solution leveraging Amazon Bedrock (Claude Sonnet) and AWS services, building a RAG system with OpenSearch Serverless to process call transcripts and provide an AI assistant. This resulted in the AI assistant handling 70% of calls, a 60% reduction in average handling time, and a 30% increase in customer satisfaction.
  • A large public healthcare company that creates software for radiologists - A healthcare company utilized PromptQL's domain-specific LLM platform to automate complex medical procedure code selection during patient appointment scheduling, addressing a bottleneck where operators spent 12-15 minutes per call navigating varied rules. This solution enables non-technical healthcare administrators to express business logic in natural language, which is converted into deterministic, executable code, bridging the gap between domain expertise and technical implementation. The implementation significantly reduces call times and training costs, projecting a $50-100 million business impact by improving operational efficiency and agility in managing thousands of clinic-specific rules.
  • AArete - AArete developed Doxy AI, a generative AI solution built on AWS Bedrock using Anthropic Claude models, to extract structured metadata from complex healthcare and financial services contracts. This system replaced unscalable manual and rules-based methods, achieving 99% accuracy and processing up to 500,000 documents per week. The solution resulted in a 97% reduction in manual effort and generated $330 million in client savings by enabling efficient contract analysis and claims overpayment identification.
  • AbbVie - AbbVie's Gaia platform leverages generative AI on AWS serverless infrastructure to automate the creation of highly regulated clinical and regulatory documents in pharmaceutical R&D, addressing a massive documentation burden. It features a modular architecture with a document orchestrator, enterprise prompt library, real-time integration with over 90 data sources, and multi-model LLM access via AWS Bedrock, employing human-in-the-loop workflows for quality assurance. The platform has automated 26 document types, saving 20,000 annual hours, with plans to scale to 350+ types by 2030, targeting over 115,000 hours in savings.
  • Abrigo - Private equity portfolio companies are deploying LLMs in production, exemplified by Abrigo using GenAI for intelligent workflows in banking to automate tasks such as credit memo generation and fraud alert summarization. Successful implementations prioritize high-friction business problems, leveraging rapid experimentation and pragmatic data readiness, rather than technology-first approaches. This transformation has made AI a mandatory consideration in investment diligence, despite ongoing challenges in talent, data security, and organizational change management.
  • Abundly.ai - Abundly.ai developed a production-grade platform for deploying autonomous AI agents in enterprises, enabling them to initiate actions, utilize various tools, and interact across multiple channels. The platform features a runtime environment, retrieval-based context management, dynamic UI generation, and robust guardrails, emphasizing "context engineering" for reliable operation. Successful deployments demonstrate significant efficiency gains, such as 95% time savings in investment screening and improved decision quality, by treating agents as digital colleagues with defined autonomy and oversight.
  • Accenture - Accenture's Spotlight platform utilizes Amazon Nova foundation models and Amazon Bedrock Agents to automate video content analysis and highlight generation, transforming workflows from hours or days to minutes. This multi-agent LLM system addresses scalability challenges in media production, achieving 10x cost savings and maintaining quality through a serverless AWS architecture and human-in-the-loop validation. The platform supports diverse applications, including sports editing, social media content creation, and real-time retail personalization.
  • Agoda - Agoda drove a company-wide GenAI transformation, initiated by a hackathon engaging 200+ developers and prototyping 40+ ideas, which scaled to over 200 production applications. This was underpinned by a centralized GenAI Proxy providing intelligent routing, governance, cost attribution, and an internal GenAI-agnostic Chat Assistant Platform. This "Inside-Out" strategy fostered internal skill development, achieving 73% employee adoption and deploying sophisticated tools like AskGoda, which automates 50% of tech support tickets.
  • AI21 Labs - AI21 Labs evolved its LLM product strategy from task-specific, fine-tuned models with pre/post-processing to address specific business needs. Recognizing the challenge of context identification, they developed a RAG-as-a-Service offering featuring semantic chunking and configurable retrieval. This culminated in Maestro, a multi-agent orchestration platform that decomposes complex queries into subtasks, orchestrates agents and tools, and provides full traceability for advanced reasoning and enterprise requirements.
  • Airtable - Airtable developed Omni, an AI-powered Q&A assistant for complex database research, addressing LLM limitations like hallucinations and context window issues with large schemas and ambiguous queries. Their solution employs an agentic framework with multi-step reasoning, contextual schema exploration, planning/replanning mechanisms, hybrid search, and a token-efficient citation system. A robust evaluation framework, combining curated test suites and production feedback, ensures the system's reliability and enables continuous iteration.
  • Airtable - Airtable developed a custom asynchronous event-driven agentic framework to power advanced AI features like Omni and Field Agents, moving beyond simple LLM capabilities. This framework utilizes a state machine with a context manager, tool dispatcher, and LLM-backed decision engine to enable dynamic decision-making, tool execution, and self-correction. It addresses critical LLMOps challenges such as multi-layered context management, structured error handling, and context window limitations through trimming and LLM-based summarization strategies.
  • Alan - Alan, a healthcare company, deployed AI agents to automate complex customer service for 1 million members, achieving 30-35% automation with human-comparable quality and processing 60% of reimbursements in under 5 minutes. Their solution utilizes a multi-agent architecture with specialized agents employing a ReAct loop for tool calling, transitioning from deterministic workflows to flexible playbooks. Critical to their success was building custom orchestration and extensive internal tooling that empowered domain experts to configure, debug, and maintain agents without engineering bottlenecks.
  • Amazon - Amazon rearchitected Alexa to evolve its scripted voice assistant into Alexa Plus, a generative AI-powered conversational system capable of complex multi-step planning and real-world actions for over 600 million devices. This transformation required a multi-model architecture, extensive prompt engineering, prompt caching, speculative execution, and API refactoring to balance accuracy, sub-2-second latency, and the interplay between determinism and creativity at massive scale.
  • Amazon - Amazon scaled Rufus, an AI-powered conversational shopping assistant, to 250 million users, initially deploying a custom in-house LLM on AWS silicon for specialized shopping queries. They transitioned to a multi-model architecture, integrating Amazon Bedrock's foundation models (e.g., Amazon Nova, Claude Sonnet) with their custom model to accelerate development and enable intelligent query routing, context management, and agentic tool use for live data and actions. This hybrid approach, leveraging web grounding and optimizations like prompt caching, balances specialization with agility for a massive-scale production LLM system.
  • Amazon - Amazon developed an AI-powered multi-agent system on Amazon Bedrock AgentCore Runtime to automate global compliance screening for approximately 2 billion daily transactions. This three-tier system employs fuzzy matching, vector embeddings, traditional ML, and specialized LLM agents that investigate potential matches by following strict SOPs and utilizing various tools. The system achieves 96% accuracy with 100% recall, automating decision-making for over 60% of cases while maintaining full auditability and escalating complex scenarios to human reviewers.
  • Amazon - Amazon Finance implemented an AI assistant to streamline financial data discovery and business intelligence for analysts struggling with vast, disparate datasets. This RAG solution utilizes Amazon Bedrock with Anthropic's Claude 3 Sonnet and Amazon Kendra Enterprise Edition for retrieval, achieving 83% precision and 88% faithfulness in knowledge search tasks. The system significantly reduced information discovery time by 85%, improving efficiency and accuracy over traditional methods.
  • Amazon - Amazon Health Services developed an LLMOps solution on AWS to overcome the limitations of traditional e-commerce search for complex healthcare queries. This system integrates a query understanding pipeline using ML and LLMs, an LLM-enhanced product knowledge base for semantic search, and a hybrid human-LLM relevance optimization system for Retrieval Augmented Generation (RAG). The solution now efficiently processes daily health searches, significantly improving customer discovery of relevant healthcare services and products.
  • Amazon - To scale their conversational AI shopping assistant, Amazon's Rufus team developed a multi-node LLM inference solution using AWS Trainium, vLLM, and ECS, as single-node capacity was insufficient. The architecture features a leader/follower design, hybrid context/data parallelism, and network topology-aware node placement, leveraging Neuron Distributed Inference with EFA for high-bandwidth communication. This enabled successful deployment across tens of thousands of Trainium chips, supporting high-traffic events and larger, more capable models for millions of customers.
  • Amazon Prime Video - Amazon Prime Video deployed two AI-powered solutions to manage content at scale: an artwork quality moderation system and a streaming quality management system. The artwork system uses multimodal LLMs and Strands agents to automatically detect defects like safe zone violations and mature content in partner submissions, reducing manual review by 88% and evaluation time from days to minutes. Concurrently, a multi-agent AI system, also built with Strands and Amazon Bedrock, autonomously detects, diagnoses, and mitigates streaming quality issues in real-time for their global audience.
  • Amplitude - Amplitude developed an internal AI agent platform, "Moda," to democratize access to its vast, siloed enterprise data and accelerate product development. Built with a custom framework ("Langley") leveraging Glean API for enterprise search, Moda provides multi-interface access (Slack bot, web app) and uses advanced agent orchestration for tasks like thematic analysis and multi-stage Product Requirements Document (PRD) generation, achieving rapid viral adoption and significant workflow compression.
  • Anterior - Anterior developed "Scalpel," a custom review dashboard, to monitor and improve their production LLM system for automating medical decision-making. Scalpel optimizes human review by surfacing contextual information hierarchically, streamlining the review workflow, and enabling domain experts to identify failure modes and suggest direct system improvements like prompt modifications or knowledge base additions. This approach generates actionable data, allowing for efficient, high-quality evaluation of AI outputs and tight integration of feedback into the LLM development lifecycle.
  • Anterior - Anterior developed an Adaptive Domain Intelligence Engine to address the "last mile problem" in applying LLMs to healthcare insurance administration for medical necessity reviews. This system leverages custom tooling for domain experts to systematically identify and categorize AI failure modes (e.g., medical record extraction, clinical reasoning), inject domain knowledge, and iteratively refine the LLM's performance. This process enabled Anterior to achieve 99% accuracy in care request approvals, improving upon a 95% baseline from initial model development.
  • Anthology - Anthology, an education BPO, transformed its contact center to an AI-first solution using Amazon Connect to address extreme seasonality, legacy system reliability issues (12 outages/peak), and repetitive student inquiries across 8 million annual interactions. The implementation leveraged AI virtual agents for self-service, AI agent assist for real-time guidance to human agents, and Contact Lens for AI-powered analytics and 100% automated quality assurance. This resulted in a 50% reduction in wait times, a 14-point increase in response accuracy, a 10% decrease in agent attrition, and improved system reliability, reducing unplanned outages from 12 to 2 during peak periods.
  • Anthropic - Anthropic's platform strategy for production agentic LLM systems, exemplified by Claude Code, focuses on maximizing performance through three pillars: exposing model capabilities via API features, advanced context window management, and robust agent infrastructure. API features include extended thinking with token budgets and reliable tool use, while context management leverages Model Context Protocol, memory tools for selective retrieval, and context editing, yielding a 39% performance improvement. For autonomous operation, Anthropic provides a secure code execution tool with sandboxed environments, container orchestration, and session persistence, complemented by agent skills for executing specific tasks.
  • Anthropic - Anthropic's experience building production AI agents, like Claude Code, demonstrates a critical architectural shift from rigid workflow-based systems to flexible agentic architectures where LLMs operate in a loop with tools to autonomously solve open-ended problems and recover from errors. Key technical challenges involve "context engineering" to manage context window limits and "context rot" through optimized system prompts, progressive tool disclosure, and memory systems for long-horizon tasks. This approach, encapsulated in the Claude Agent SDK, focuses on robust, cost-efficient infrastructure and anticipates agents gaining full computer access for broader applications beyond software engineering.
  • Anthropic - Anthropic addressed the challenge of LLM agents failing on long-running, multi-context software development tasks due to context window limitations and lack of persistent memory. Their solution uses a dual-agent harness: an initializer agent sets up a structured environment with detailed JSON feature lists and version control, while a coding agent works incrementally, making clean commits, and using browser automation for robust testing. This approach enables sustained, multi-session development of production-quality web applications by providing explicit behavioral constraints and structured context management.
  • Anthropic - Anthropic developed and open-sourced the Model Context Protocol (MCP) to standardize how LLMs connect to external data sources and tools, addressing the problem of custom, duplicated integration efforts across production applications. MCP evolved from requiring local servers to supporting remote-hosted endpoints and native API connectors, significantly reducing developer friction and enabling sophisticated use cases. Production deployment emphasizes careful tool description as prompt engineering and selective context management to optimize model performance, cost, and leverage emergent cross-tool capabilities.
  • Anthropic - This case study presents a methodology for scaling LLM application observability by using LLMs to automatically analyze user conversations, addressing the challenge of data paralysis from millions of complex interactions. The approach involves an LLM-powered pipeline for summarizing, embedding, and hierarchically clustering conversations to identify common issues and usage patterns. This enables teams to develop LLM-as-judge classifiers for continuous monitoring, prioritize high-leverage fixes based on data, and quantitatively track product improvements, as demonstrated by Anthropic's CLEO system and the open-source Kura library.
  • Anthropic - Anthropic's Claude Developer Platform supports production-ready autonomous agentic systems by "unhobbling" models, enabling them to autonomously select tools and manage workflows. The platform provides the Claude Code SDK as a general-purpose agentic harness, automating tool calling and context management, alongside features like web search, code execution, prompt caching, and agentic memory. This approach aims to maximize model capabilities by minimizing developer-imposed constraints while offering critical production features for observability and control.
  • Anthropic - Anthropic's Claude Code utilizes a single-threaded master loop ("nO") for autonomous coding, prioritizing debuggability and transparency over complex multi-agent systems. This architecture integrates real-time steering ("h2A"), context compression to Markdown files, comprehensive sandboxed tools (e.g., GrepTool, diff-based editing, Bash with safety), and controlled sub-agent parallelism. Its pragmatic design, featuring a flat message history and robust safety, proved highly effective, leading to continuous user engagement and the implementation of usage limits.
  • App.build - App.build developed production AI agents for software development, distilling six principles for robust deployment. These principles emphasize detailed system prompt engineering, strategic context management, simple and idempotent tool design, and actor-critic feedback loops for systematic validation. They also highlight using LLMs for meta-agentic error analysis and attributing agent failures to system design rather than solely model limitations, reflecting mature LLMOps practices.
  • Arcade - Arcade identified a critical security gap in the Model Context Protocol (MCP) where AI agents lacked secure mechanisms to obtain third-party credentials for external services, forcing insecure workarounds. They extended MCP's elicitation framework with a new URL mode (PR #887) to enable secure OAuth 2.0 flows. This solution leverages browser redirects for authentication, establishing clear security boundaries between trusted servers and untrusted clients, thereby facilitating production-ready AI agent deployments with proper scoped access.
  • Arize AI - Arize AI developed Alyx, an AI agent embedded in their observability platform to democratize expert debugging and optimization of ML/GenAI applications by codifying solutions architect expertise. Initially, Alyx used a structured, "on-rails" tool-calling architecture with GPT-3.5, offloading complex mathematical computations to traditional code, and is now evolving towards a more autonomous, planning-based system. Its development involved iterative prototyping with real customer data, extensive internal dogfooding, and a comprehensive, multi-level evaluation framework to ensure reliability and effectiveness.
  • Articul8 - Articul8 developed a domain-specific generative AI platform leveraging public models, proprietary data, and a "model mesh" for intelligent runtime orchestration of LLM and non-LLM models. Built on AWS with SageMaker HyperPod for distributed training, it processes vast multimodal data to create knowledge graphs, enabling automated root cause analysis and supply chain optimization. For an automotive manufacturer, this platform reduced incident dissemination time from 90 to 30 seconds, automating expert functions and improving production yield by connecting incident data with supplier and inventory information.
  • AstraZeneca - AstraZeneca deployed enterprise-wide agentic AI platforms with AWS, including a Clinical Development Assistant and an AZ Brain commercial platform, to accelerate drug development and commercial operations. The Development Assistant uses a multi-agent system integrating 16 data products for 1000+ R&D users, while AZ Brain employs 500+ AI models and agents on a unified data foundation to provide personalized commercial insights. These production-grade LLM deployments have reduced time-to-market for workflows from months to weeks and increased prescription generation by 2x for commercial teams, demonstrating significant ROI in a regulated environment.
  • Atlassian - Atlassian implemented an ML-based comment ranker to improve the quality of LLM-generated code review comments by filtering out noisy or unhelpful suggestions. This system uses a fine-tuned ModernBERT model, trained on proprietary user interaction data where "code resolution" (actual code changes in response to a comment) serves as the ground truth. It has significantly increased code resolution rates from approximately 33% to 40-45%, nearing human performance, and operates robustly across various underlying LLMs for over 10,000 monthly active users.
  • AWS - AWS implemented Account Plan Pulse, an LLMOps solution built on Amazon Bedrock, to automate and optimize internal sales account planning by processing CRM data, evaluating plans against business criteria, and generating recommendations. The system employs sophisticated preprocessing, structured output prompting, and a statistical Coefficient of Variation (CoV) analysis across multiple model runs to manage LLM non-determinism and stabilize output variability for automated review thresholds. This implementation resulted in a 37% improvement in plan quality and a 52% reduction in processing time.
  • AWS - Japan's GENIAC program utilized AWS to provide 12 organizations with 127 P5 and 24 Trn1 instances for large-scale foundation model training. Success required a comprehensive operational framework beyond raw compute, encompassing structured cross-functional support, pre-validated reference architectures (AWS ParallelCluster, SageMaker HyperPod, FSx for Lustre), robust monitoring (Prometheus/Grafana), and extensive enablement programs. This systematic approach facilitated the successful training of multiple 100B+ parameter models, demonstrating that large-scale AI development is fundamentally an organizational and systemic challenge.
  • AWS / Vercel - The case study identifies inadequate platform architecture as a key reason 46% of AI POCs fail to reach production, stressing the need for robust foundations in model switching, evaluation, and observability. AWS Bedrock provides unified APIs, guardrails, and Agent Core for building and evaluating durable, scalable agents, which Vercel leverages with its AI SDK, AI Gateway, and Workflow Development Kit to deploy production applications like V0 and Vercel Agent.
  • AWS / WHOOP - AWS Support transformed from a reactive model to a proactive, AI-powered system leveraging Amazon Bedrock and Connect, implementing multi-tiered AI, graph-based RAG, and structured SOPs to manage complex cloud workloads. This enabled faster incident response and proactive guidance, driven by comprehensive context and rigorous evaluation. Consequently, customer WHOOP achieved 100% availability during a major product launch, reducing critical case response times from 8 to under 2.5 minutes and improving quarterly availability from 99.85% to 99.95%.
  • Bank CenterCredit - Bank CenterCredit (BCC) deployed generative AI and machine learning workloads using a hybrid multi-cloud architecture centered on AWS Outpost and AWS KMS with External Key Store to meet strict regulatory compliance for data encryption and anonymization. This setup enabled fine-tuning an ASR model, achieving 23% accuracy improvement by processing sensitive data on-premise and training in the cloud, and a hybrid RAG HR chatbot that handles 70% of requests by keeping the knowledge base on Outpost while leveraging Amazon Bedrock.
  • Bayezian Limited - Bayezian Limited deployed a multi-agent AI system, utilizing custom Python pipelines and FAISS for semantic retrieval, to monitor clinical trial protocol deviations by having specialized agents check rules like visit timing and medication use, aiming to augment human reviewers. While the system improved efficiency and early pattern detection, it faced significant production challenges including inter-agent handover failures, memory lapses regarding contextual rules, and difficulties with real-world data ambiguities. Iterative improvements like structured memory snapshots and explicit handoff signals enhanced its utility, demonstrating agentic AI's value for structured, traceable checks while clarifying its current limitations in complex inference and coordination.
  • Baz - Baz is an AI-powered code review platform that combines Abstract Syntax Trees (ASTs) for structured code traversal with LLMs for semantic understanding. Its core innovation is extensive context gathering from code structure, project details, ticketing systems, and CI/CD logs to provide highly relevant reviews. This enables it to identify complex issues like performance problems or schema changes, even when reviewing AI-generated code.
  • Beams - Beams developed a semantic search system to efficiently analyze massive volumes of unstructured, jargon-filled aviation safety reports. The system uses Azure OpenAI embeddings stored in PostgreSQL with PG Vector for similarity search, enhanced by a two-stage retrieval and reranking pipeline for precision. It also integrates complex structured filtering with semantic search, deployed on AWS Lambda and RDS, to enable rapid identification of trends and risks.
  • BlackRock - BlackRock implemented a modular LLM framework to accelerate custom AI application development for investment operations, reducing development time from 3-8 months to days. This framework provides a Sandbox for domain experts to configure prompts, extraction templates, and LLM strategies, coupled with an App Factory for automated production deployment. It emphasizes human-in-the-loop processes, multi-LLM strategy selection, and robust validation to ensure compliance and quality control in a regulated financial environment.
  • Bloomberg Media - Bloomberg Media developed an AI-driven platform to analyze and leverage its 13 petabytes of video archives, integrating task-specific models, vision language models (VLMs), and multimodal embeddings for comprehensive content understanding. This platform features a federated search with LLM-driven intent analysis, knowledge graphs for contextual relationships, and orchestrated AI agents for automated, platform-specific content assembly. Adopting a "disposable AI strategy," the architecture prioritizes modularity, extensive versioning, and parallel production/non-production tiers to ensure adaptability, continuous improvement, and rapid content distribution.
  • Bonnier News - Bonnier News, a major Swedish publisher, deploys production AI systems for content personalization and journalistic workflows across its 200+ brands. Their core personalization engine uses embedding-based vector similarity and user reading patterns to deliver scalable, white-label recommendations that match human curation. Additionally, they leverage LLMs for features like trigger questions and news aggregation summaries, and are actively researching domain-adapted Swedish Llama models via continued pre-training for internal, sensitive journalistic applications.
  • Booking.com - Booking.com deployed a GenAI agent to streamline partner-guest messaging, automating responses to 250,000 daily inquiries previously handled manually. This agent, built as a Kubernetes microservice using LangGraph and GPT-4 Mini, employs semantic search with MiniLM embeddings and Weaviate for template retrieval, alongside GraphQL for property and reservation data, all protected by PII redaction and topic guardrails. The system autonomously suggests or generates replies, leading to a 70% increase in user satisfaction, reduced follow-up messages, and faster response times for tens of thousands of daily messages.
  • Booking.com - Booking.com implemented a GenAI agent to automate partner-guest messaging, addressing the bottleneck of manual responses that caused delays and potential booking cancellations. This agent, built as a Kubernetes microservice using LangGraph and GPT-4 Mini, leverages semantic search for template retrieval and integrates with property/reservation data, operating behind an internal LLM gateway with PII redaction. Handling tens of thousands of daily interactions, pilot results show a 70% improvement in user satisfaction, reduced follow-up messages, and faster response times for partners.
  • Booking.com - Booking.com implemented an LLM-as-a-judge framework to automate the evaluation of generative AI applications at scale, addressing the impracticality of human evaluation and limitations of traditional metrics for open-ended text generation. This framework uses a powerful LLM to continuously assess the outputs of target LLMs in production, trained on high-quality "golden datasets" created with rigorous human annotation protocols and iterative prompt engineering. It enables automated monitoring for critical issues like hallucination and instruction-following, significantly reducing human oversight and operational costs across various LLM-powered use cases.
  • Bosch Engineering / AWS - Bosch Engineering and AWS developed a next-generation AI-powered in-vehicle assistant with a hybrid edge-cloud architecture. This system processes simple, offline queries at the edge, while complex, multi-step requests are routed to the cloud, leveraging Amazon Bedrock and an API steward for external API integration (e.g., booking, diagnostics). It incorporates robust LLMOps for continuous model improvement via edge metric capture and over-the-air updates, deployed on Bosch's Software-Defined Vehicle platform with resilient connectivity management.
  • Box - Box evolved its enterprise document data extraction from simple single-shot LLM prompting to a sophisticated agentic AI architecture to overcome limitations with complex documents, OCR variability, and multilingual needs. Initially, off-the-shelf LLMs showed promise, but struggled with large documents and numerous fields. The agentic system orchestrates multiple AI models and tools in a directed graph, employing multi-step processing, validation, and iterative refinement to achieve high accuracy and reliability for diverse enterprise content.
  • BrainGrid - BrainGrid deployed a multi-tenant Model Context Protocol (MCP) server on serverless platforms, encountering constant re-authentication and high JWT validation latency due to stateless instances. To solve this, they implemented a Redis-based session store with AES-256-GCM encryption and a fast-path/slow-path authentication pattern, caching validated JWTs to persist sessions across ephemeral serverless instances. This approach significantly reduced authentication overhead, eliminated re-authentication fatigue, and enabled secure, scalable multi-tenant operation for their AI-assisted development tools.
  • Brex - Brex deployed an AI-powered financial assistant using Amazon Bedrock and Claude models to automate corporate expense management, addressing inefficiencies from manual processes and policy compliance. The solution features a custom LLM Gateway for intelligent routing and multi-model orchestration, a dual-layer AI compliance judge, and integrates external data for contextual reasoning. This implementation achieved 75% expense workflow automation, saving hundreds of thousands of hours monthly and improving compliance rates from 70% to over 90%.
  • BT - British Telecom partnered with AWS to deploy agentic AI systems for autonomous network operations across its 5G standalone mobile network, aiming to reduce high operational costs and complexity in managing 20,000 macro sites and 11,000 weekly changes. The solution leverages AWS Bedrock Agent Core, SageMaker for multivariate anomaly detection, Neptune for network topology graphs, and domain-specific community agents to perform root cause analysis and service impact assessment. This initiative focuses on improving service level agreements through faster issue detection, enhancing change efficiency, and enabling proactive network optimization.
  • Bundesliga - Bundesliga deployed Gen AI solutions on AWS to scale content production and enhance fan engagement for over 1 billion global fans. This involved automating match reports and short-form stories, localizing video content with 75% time savings and 3.5x cost reduction, and developing an AI-powered fan companion (MatchMade) using dynamic routing for text-to-SQL queries and proactive nudging. Leveraging Amazon Nova for cost optimization, these systems achieved significant efficiency gains, including 90% time savings for editors and a 35% cost reduction in chatbot services, while serving over 100,000 users in production.
  • Bundesliga - Bundesliga deployed production-scale generative AI and LLMOps solutions to enhance global fan engagement, built on a robust data infrastructure processing real-time match data. Key implementations include an AI-powered live ticker generating multilingual, styled commentary within 7 seconds, an Intelligent Metadata Generation system using multimodal AI to tag 9+ petabytes of archival footage, and automated content localization and mobile story creation from existing articles. These systems significantly increased app usage, content consumption, and processing efficiency for over 1 billion fans.
  • Canada Life - Canada Life transformed its contact center by migrating 21 business units to Amazon Connect in 7 months to address long wait times and poor self-service. This involved implementing AI capabilities like LLM-powered call summarization, automated authentication, and an 83% contained chatbot, Cali. The initiative resulted in a 92% reduction in average speed to answer (to 18 seconds), a 10% reduction in average handle time, and $7.5 million in savings in H1 2025.
  • Capital One / RBC / Visa - Financial institutions like Capital One and RBC are deploying agentic AI systems in production for tasks such as automotive purchasing assistance, investment research, and fraud detection. These multi-agent systems, built on fine-tuned open-source models and proprietary data, require 100-200x more compute than generative AI due to their reasoning and action capabilities. This approach yields significant benefits like 60% faster report generation and 10x more data analysis, necessitating robust AI factory infrastructure for continuous training and high-scale inference.
  • Care Access - Care Access addressed an LLMOps scalability challenge in processing hundreds of medical records daily with LLMs, where repeatedly sending entire static records for multiple analysis questions resulted in high costs and slow processing. They implemented prompt caching on Amazon Bedrock, caching the large medical record content as a static prefix and only processing dynamic analysis questions. This optimization achieved an 86% reduction in inference costs and 66% faster processing times, enabling efficient and compliant scaling of their health screening program.
  • Caylent - Caylent, a development consultancy, deploys production LLM systems across diverse verticals like environmental management, multimodal video search, and logistics document processing. They leverage AWS Bedrock/SageMaker, PostgreSQL with pgvector, and custom silicon, prioritizing prompt engineering, context optimization, and inference economics. Their pragmatic LLMOps approach emphasizes understanding user context, hybrid search, and avoiding LLMs for simple computational tasks to ensure reliable and cost-effective deployments.
  • CBRE - CBRE implemented an LLMOps solution using Amazon Bedrock to unify fragmented property data across 10 sources, enabling natural language search and a digital assistant within their PULSE system. The architecture leverages Amazon Nova Pro for SQL generation and Claude Haiku for RAG-based document interaction, employing sophisticated prompt engineering, two-stage tool selection, and multi-layered security with Redis for granular access control. This resulted in a 67% reduction in SQL query generation time, 80% database query performance improvement, 60% token usage reduction, and 95% search accuracy.
  • CDL - CDL deployed production AI agents on Amazon Bedrock for insurance policy management, utilizing a supervisor-agent architecture with specialized domain agents and an anti-corruption layer to integrate with existing APIs. The system employs dynamic prompt engineering with OpenAPI specifications for tool use, rigorous continuous model evaluation via LLM-as-a-judge and human experts, and Bedrock Guardrails for safety, including topic-based filtering and comprehensive logging.
  • ChromaDB - ChromaDB's technical report evaluated 18 LLMs, including GPT-4.1 and Claude 4, for performance degradation with increasing input token counts, challenging the assumption of uniform context processing. The study found significant reliability decreases with longer inputs, even for simple retrieval and replication tasks, due to factors like needle-question similarity, distractors, and surprising impacts of haystack structure. This research highlights the need for careful context engineering and improved benchmarks to address real-world LLM limitations in long-context production deployments.
  • Circle - Circle developed an experimental AI-powered escrow agent system integrating OpenAI's multimodal models (GPT-4 Vision) with its USDC stablecoin and smart contract infrastructure on Ethereum-compatible networks like Base. This system uses AI to parse PDF contracts, extract structured terms, and programmatically deploy escrow smart contracts. It also employs AI for work verification via image analysis, enabling automated, near-instant settlement of funds with human oversight.
  • Cires21 - Cires21 developed MediaCoPilot, a unified AI-powered video workflow orchestration platform on AWS, to address broadcasters' fragmented application ecosystems causing slow content delivery and high costs. The serverless platform integrates custom AI models on SageMaker for audio/video processing (ASR, scene detection) with Amazon Bedrock for generating higher-level metadata like summaries and subtitles, orchestrated by AWS Step Functions. It also incorporates AWS Agent Core-powered AI agents to automate complex multi-step tasks, exposing capabilities via an API-first strategy for seamless client integration.
  • Cisco - Cisco's Outshift developed a multi-agent AI system to automate and improve network change validation, addressing high failure rates in production environments. This system uses specialized AI agents with ReAct loops, a knowledge graph-based digital twin built on OpenConfig schema, and a natural language interface, integrating with ITSM tools for automated impact assessment and test plan generation. Performance was significantly optimized by fine-tuning the query agent for direct knowledge graph interaction, reducing token consumption and response times compared to RAG.
  • Clario - Clario developed an AI-powered system to automate the review of Clinical Outcome Assessment (COA) interviews in clinical trials, addressing challenges of manual, time-consuming, and variable human assessment. The solution leverages AWS services for secure data ingestion, custom speaker diarization, multi-lingual transcription (Whisper), semantic retrieval (OpenSearch with Titan embeddings), and a graph-based agentic AI (Claude 3.7 Sonnet) to systematically evaluate interviews against standardized criteria. This architecture aims to reduce manual review effort by over 90%, achieve 100% data coverage, and decrease turnaround time from weeks to hours, while ensuring regulatory compliance and improving data quality.
  • Clario - Clario automated its manual, error-prone clinical trial software configuration process, which involved extracting data from PDF forms and integrating with study databases. They developed the Genie AI Service, leveraging Anthropic's Claude 3.7 Sonnet via Amazon Bedrock and orchestrated on Amazon ECS, to extract structured data from transmittal forms and generate standardized XML configurations. This LLM-powered system, incorporating a human-in-the-loop validation workflow, significantly reduced configuration execution time, improved data quality, and minimized transcription errors in a highly regulated healthcare environment.
  • Clay - Clay is an AI-powered sales intelligence platform leveraging LLM agents like Claygent and Navigator to perform on-the-fly research, extracting unique, custom data points from unstructured web sources for go-to-market strategies. Operating at over one billion agent runs annually, its serverless AWS Lambda architecture provides resilient and scalable execution for these agents, addressing challenges like rate limiting and failure recovery. The platform's design embraces data imperfection, offering tools for iterative validation, transparency via session replay, and transforming raw web data into actionable, structured insights to achieve "go-to-market alpha."
  • Cleric - Cleric developed an AI agent to automate root cause analysis for production alerts by analyzing observability data, logs, and metrics across cloud infrastructure. This agent operates asynchronously, planning and executing diagnostic tasks via API calls to various tools, then iteratively reasoning to distill findings into actionable root causes. Key challenges include establishing ground truth for evaluation, developing robust simulation environments for testing, and managing the inherent complexity of distributed production data and model reliability.
  • CloudQuery - CloudQuery built a Go-based Model Context Protocol (MCP) server to enable LLMs like Claude to query their cloud infrastructure database, encountering challenges with tool selection, context window limits, and non-determinism. They addressed this by rewriting tool descriptions to be verbose and domain-specific, embedding multi-tool workflows, renaming tools for semantic clarity, and implementing schema filtering in the MCP server to reduce token usage by 90%. These technical adjustments enabled the LLM to reliably discover and execute tools, transforming its behavior from hallucinating queries to systematically following a data discovery-to-execution pipeline.
  • Cognee - Cognee developed an AI memory layer combining knowledge graphs (Kuzu) with LanceDB, a file-based vector database, to solve the "isolation problem" in LLMOps. This architecture enables per-workspace vector store isolation, simplifying parallel development and CI/CD by treating each instance as a separate directory, while co-locating data and embeddings to prevent synchronization issues. The system uses an Extract-Cognify-Load pipeline with incremental updates and hybrid graph/vector retrieval to enhance multi-hop reasoning and provide a consistent local-to-production workflow.
  • Cognition - Cognition's autonomous AI software engineer, Devon, addresses LLM limitations in large codebases by using DeepWiki, a knowledge graph that extracts and maps codebase concepts from code and metadata. It further employs Devon Search for grounded codebase research and custom post-training via multi-turn reinforcement learning with automatic verification. This approach enabled Kevin 32B, a specialized model, to achieve 91% correctness on CUDA kernel generation, outperforming larger frontier models, and allows Devon to autonomously handle tasks from ticket to pull request.
  • Coinbase - Coinbase scaled customer support, compliance, and developer productivity by implementing a Gen AI platform on AWS Bedrock, standardizing LLM access via OpenAI API and data access with Model Context Protocol. This platform deployed multi-layered agentic chatbots for customer support, an AI-powered Compliance Assist tool for investigations, and developer tools for code generation, PR review, and UI testing. These solutions led to 65% customer contact automation, 40% AI-influenced code, and significant annual savings in employee hours and operational costs, alongside improved resolution times and quality.
  • Commonwealth Bank of Australia - Commonwealth Bank of Australia (CBA) developed "Lumos," a multi-agent AI platform, to accelerate the migration and modernization of legacy Windows 2012 applications to cloud-native architectures, addressing bottlenecks like poor documentation and slow manual processes. Leveraging AWS Bedrock, OpenSearch, and agent orchestration frameworks, Lumos automates application analysis, code transformation (using a hybrid AI/deterministic approach), AI-driven UI test generation, and deployment, integrating with CBA's existing DevOps platform. This system increased modernization velocity by 2-3x, enabling 20-30 applications per quarter while ensuring quality, security, and compliance through confidence scoring and human-in-the-loop validation.
  • Commonwealth Bank of Australia - Commonwealth Bank of Australia (CBA) migrated 61,000 on-premise data pipelines (10 petabytes) to an AWS-based data mesh to modernize its infrastructure for AI/ML workloads. This large-scale migration was accelerated by AI and generative AI systems that automated legacy code transformation, performed error checking, and ensured 100% data accuracy through 229,000 output tests. The new CommBank.data platform provides a federated architecture with self-service data access for 40 business units, enabling scalable AI-driven innovation under strict governance.
  • Commonwealth Bank of Australia - Commonwealth Bank of Australia (CommBank) faced challenges scaling AWS Well-Architected Reviews due to their time-intensive nature and reliance on numerous subject matter experts. To address this, CommBank partnered with AWS to develop the GenAI-powered "Well-Architected Infrastructure Analyzer." This solution leverages AWS Bedrock to analyze infrastructure-as-code (CloudFormation, Terraform), architectural diagrams, and organizational documentation, automatically mapping resources against Well-Architected best practices to generate comprehensive reports and recommendations. This automation significantly reduces the time and expertise required, enabling continuous architectural assessment across all workloads and fostering proactive improvement.
  • Condé Nast - Condé Nast automated its manual, error-prone contract processing with an AWS-based multi-stage LLM pipeline. This system leverages Amazon Bedrock with Claude 3.7 Sonnet for PDF-to-text conversion via visual reasoning, metadata extraction through structured prompting, and template matching using RAG against an OpenSearch vector store. The solution reduced contract processing time from weeks to hours, improved rights analysis accuracy, and enabled domain experts to drive AI application development.
  • Control Plain - Control Plain addressed AI agent unreliability in production by implementing "intentional prompt injection," a dynamic technique that avoids unwieldy "franken-prompts" by semantically matching user input to a database of policy rules. At runtime, relevant rules are injected directly into the user message, leveraging the LLM's recency bias for contextual guidance. This approach significantly improved an airline support agent's success rate from 80% to 100% on complex tasks, ensuring reliable and maintainable agent behavior.
  • ConverseNow - A multi-company panel discussed production LLM deployment strategies, highlighting ConverseNow's use of extensively fine-tuned small language models for high-accuracy, real-time voice AI in restaurants, as general-purpose models proved insufficient. The discussion emphasized that fine-tuned small models (1-12B parameters) can outperform larger ones in specific domains, driven by advancements in open-source models, sophisticated fine-tuning techniques, and optimizations like Mamba architectures and FP4 quantization. Key infrastructure considerations include transitioning from token-based pricing to TCO for high-volume applications and the convergence of training and inference systems for efficient scaling.
  • Cosine - Cosine addressed the challenge of deploying high-performance coding agents in resource-constrained, regulated enterprise environments by developing a multi-agent LLM architecture. This system leverages specialized orchestrator and worker models, optimized through techniques like model distillation, supervised fine-tuning (SFT), preference optimization, and reinforcement fine-tuning (RFT), including a multi-LoRA approach for GPU footprint reduction. The solution achieved a 31% performance increase on SWE-bench Freelancer, 3X latency improvement, 60% GPU footprint reduction, and 20% fewer errors, enabling deployment on minimal hardware while outperforming larger frontier models.
  • Coveo - Coveo developed an enterprise RAG system integrating its AI-Relevance Platform with Amazon Bedrock Agents to ensure accurate, permission-aware LLM responses by grounding them in enterprise knowledge. This system leverages Coveo's Passage Retrieval API, employing a two-stage hybrid search (semantic and lexical) with machine learning for precise passage extraction and relevance optimization from a unified index. The architecture uses AWS Lambda for scalable API bridging, CloudFormation for IaC, and implements early-binding permission management to enforce data access controls at crawl time.
  • Cox Automotive - Cox Automotive scaled AI agents to production for autonomous customer service, addressing the challenge of after-hours lead response and complex operational requirements. They rapidly deployed multiple agentic systems in five weeks using Amazon Bedrock Agent Core and the Strands framework, implementing multi-agent orchestration, comprehensive red teaming, two-tier guardrails, LLM-as-judge evaluation, and circuit breakers. This enabled autonomous customer interactions, with three products reaching production beta and positive dealer feedback on timely customer responses.
  • Cresta - Cresta, founded by Stanford AI Lab PhDs with OpenAI experience, developed an AI copilot for contact center agents, providing real-time suggestions to improve performance. Their technical journey evolved from custom LSTMs and transformers to fine-tuning large foundation models like GPT-3/4 on domain-specific conversation data, addressing MLOps challenges for enterprise-scale production. This system demonstrated significant ROI for Fortune 500 clients through rigorous A/B testing, proving the value of augmenting human agents with AI.
  • CrowdStrike - CrowdStrike's Charlotte AI is an agentic AI system integrated into their Falcon platform, designed to automate cloud security detection, investigation, and response workflows. It ingests multi-layered security data from cloud control planes, workloads, and Kubernetes to perform automated alert triage, correlate events, and generate detailed incident reports with actionable recommendations. This system significantly reduces manual effort for security operations by providing rapid, context-rich analysis and response to complex cloud threats.
  • cubic - cubic's AI code review agent initially generated excessive false positives, eroding user trust due to low-value comments. To address this, they implemented architectural changes including requiring explicit reasoning logs, streamlining the agent's tooling to essential components, and transitioning to specialized micro-agents. These refinements led to a 51% reduction in false positives without sacrificing recall, significantly improving the agent's precision and utility in production.
  • Cursor - Cursor optimized its agent harness for OpenAI's GPT-5.1-Codex-Max model by adapting prompt engineering, aligning tool naming with shell commands, and carefully managing reasoning summaries. A critical finding was that preserving reasoning traces, the model's internal chain-of-thought, was essential, as their omission led to a 30% performance drop on coding tasks. This highlights the necessity of model-specific prompt tuning and state management for production LLM agents.
  • Cursor - Cursor developed Composer, a specialized coding agent model, to overcome the "airplane Wi-Fi problem" of slow existing agents by balancing near-frontier intelligence with 4x faster token generation. This was achieved through extensive reinforcement learning (RL) training in a production-matched environment, leveraging custom kernels for speed, parallel tool calling, and semantic search with custom embeddings. The result is a model that enables developers to stay in flow state with synchronous, fast interactions for real-world software engineering tasks.
  • Cursor - Cursor's case study on scaling AI-assisted coding reveals that effective LLM integration requires developers to master new skills like precise task decomposition and rigorous context management, including starting fresh chat windows for new tasks to prevent performance degradation. Successful enterprise deployments leverage semantic search with concrete code examples for brownfield development, implement deterministic hooks for non-negotiable rules, and continuously optimize agent harnesses, which significantly boosts performance. Ultimately, developers must maintain full responsibility for all AI-generated code, using LLMs as learning tools rather than abdicating strategic decision-making.
  • Cursor - Cursor, an AI-native code editor, successfully competed in a crowded market by forking VS Code for deep AI integration and developing "Cursor Tab," a next-line code completion feature that delivered immediate, intuitive value. They built custom, fine-tuned models optimized for speed and user experience, rather than solely relying on commercial APIs, enabling rapid iteration through intensive dog-fooding and a product-led growth strategy. This approach allowed them to outmaneuver larger competitors by focusing on near-term developer needs and efficient LLMOps.
  • Cursor - Cursor developed Composer, an agent-based LLM for coding, to achieve both high intelligence and four times faster token generation than comparable models, addressing the challenge of maintaining developer flow with interactive AI. They trained a mixture-of-experts model using reinforcement learning on a custom LLMOps infrastructure, integrating low-precision MXFP8 kernels for speed and microVM-based production environments for realistic rollouts. This enabled Composer to learn efficient parallel tool calling and deliver rapid, accurate code modifications, demonstrating the scalability of RL for specialized coding tasks.
  • Cursor - Cursor enhanced their AI coding agent by developing a custom semantic search system to improve code retrieval in large codebases, addressing the limitations of traditional regex-based search. This involved training a specialized embedding model using production agent session traces and integrating it with fast indexing pipelines, complementing existing grep functionality. The solution yielded a 12.5% average increase in offline question-answering accuracy, a 0.3% rise in code retention (2.6% for large codebases), and a 2.2% reduction in dissatisfied user requests in A/B tests.
  • Cursor - Cursor deployed Cursor Tab, an LLM-based code completion system handling over 400 million daily requests, to address noisy suggestions and improve user experience. They implemented online reinforcement learning with policy gradient methods, using real-time user acceptance/rejection as a reward signal to teach the model when to show suggestions. This required sophisticated infrastructure for rapid model deployment and on-policy data collection, resulting in 21% fewer suggestions and a 28% higher accept rate.
  • Databook - Databook tackled "tool surface pollution" in enterprise agentic AI systems, where LLMs exposed to full APIs via MCP suffered from excessive irrelevant data, increased costs, and reduced reliability due to "choice entropy." Their solution, "tool masking," introduces an intermediate configuration layer that filters and reshapes tool input/output schemas, customizes tool interfaces for specific agents, and enables prompt engineering of the tools themselves. This approach yields more reliable, cost-effective, and faster agents with enhanced self-correction and agile adaptation, crucial for production-scale enterprise workflows.
  • Databricks - Databricks developed an AI-powered agentic platform to streamline debugging of thousands of MySQL OLTP instances across multi-cloud environments, addressing fragmented tooling and context-gathering overhead. This platform unifies metrics, logs, and operational workflows via an interactive chat assistant, leveraging a central-first sharded architecture and a rapid iteration framework for agent development. The solution significantly reduced debugging time by up to 90% and enabled new engineers to quickly initiate investigations, demonstrating measurable impact on operational efficiency.
  • Daytona - Daytona developed an "agent-native runtime" infrastructure specifically for autonomous AI agents, recognizing that traditional human-centric development tools fail without human intervention. This system provides secure, elastic sandboxes that spin up in 27 milliseconds, featuring an API-first design for programmatic control, declarative image building, shared volumes for data, and parallel execution capabilities. These technical features enable agents to autonomously manage their environments and execute tasks efficiently without human oversight.
  • DeLaval / Arelion - Kolomolo developed two production-grade LLM solutions: Unity Ops, a multi-agent system for DeLaval, automates incident response and root cause analysis for dairy farm equipment, leveraging RAG and serverless AWS architecture to reduce SRE reactive work by 80%. For Arelion, a hybrid ML/LLM system classifies and extracts critical information from vendor maintenance emails, achieving 97% accuracy and an 80% reduction in manual workload by combining XGBoost for classification with Bedrock for entity extraction and automated prompt optimization. Both implementations prioritize cost efficiency through strategic model selection and intelligent routing, demonstrating scalable LLMOps for complex operational challenges.
  • Delivery Hero - Delivery Hero developed an AI-powered image generation system to address low conversion rates caused by 86% of food products lacking images. The system utilized self-hosted Stable Diffusion models for text-to-image and inpainting, featuring extensive MLOps, automated quality evaluation, and significant model optimization that reduced generation costs to under $0.003 per image. This solution generated over 1 million images, leading to a 6-8% improvement in conversion rates for products with AI-generated visuals.
  • Deloitte - Deloitte developed an AI-augmented cybersecurity triage system using AWS's Graph RAG Toolkit to manage the overwhelming volume of cloud security alerts. This "AI for Triage" solution leverages hierarchical lexical graphs for long-term organizational memory and document graphs for short-term operational data, employing a hybrid RAG approach with entity network contexts to provide nuanced insights. The system reduced 50,000 security issues to approximately 1,300 actionable items, generating structured triage records and automation recipes while maintaining human oversight and accountability.
  • Delphi / Seam AI / APIsec - Three AI-native companies (Delphi, Seam AI, APIsec) have evolved their production LLM deployments over three years, transitioning from single-shot prompting to fully agentic systems. They leverage serverless infrastructure, Pydantic AI, and Pinecone, balancing deterministic state machines for high-confidence tasks with autonomous model reasoning. This involves managing massive token consumption, prioritizing product-market fit over immediate cost optimization, and measuring ROI via outcome-based metrics, generally avoiding fine-tuning in favor of base model improvements.
  • DevCycle - DevCycle built a production-ready MCP server to enable AI agents to manage feature flags via natural language directly within developer workflows, eliminating context switching. Key technical insights included designing input schemas with explicit descriptions for agent context, implementing descriptive error handling for agent self-correction, and consolidating tool calls for LLM efficiency. This integration resulted in a 3x increase in SDK installation during onboarding by facilitating in-editor feature flag creation and management.
  • Digital asset market makers - An agentic LLM-based platform was developed for digital asset market makers to rapidly analyze streaming news and social media, enabling sub-10-second response times for risk management. It leverages fine-tuned BGE-M3 embeddings for efficient deduplication and reasoning models like DeepSeek for sentiment and impact classification. Critical inference optimization, progressing from SageMaker JumpStart to VLLM and finally SGLNG, achieved 180 output tokens per second, making the real-time analysis viable.
  • Digits - Digits, an accounting automation company, deployed production-scale LLM agents using Kotlin and Golang to automate workflows like vendor hydration and client onboarding. Their robust architecture incorporates LLM proxies for failover, sophisticated memory services, OpenTelemetry-compatible observability, and guardrails with separate models for generation and evaluation. This system achieves a 96% acceptance rate on classification tasks, demonstrating reliable and auditable automation for financial operations.
  • DoorDash - DoorDash implemented an LLM-powered voice assistant to replace DTMF IVR for restaurant hour verification, using a factory pattern for backward compatibility and prompt engineering to extract structured data from natural language conversations. Concurrently, they developed an LLM-based personalized alcohol recommendation system that generates item suggestions and dynamic carousel titles via a two-stage LLM pipeline. Both systems integrated into existing DoorDash infrastructure, emphasizing abstraction, incremental rollout, and structured data extraction for production deployment.
  • DoorDash - DoorDash addressed behavioral silos in multi-vertical recommendations by implementing an LLM-powered hierarchical RAG (H-RAG) pipeline. This system translates user behavior from data-rich verticals into cross-vertical semantic affinity features, which are then integrated into their multi-task ranking models. The approach delivered approximately 4-5% relative improvements in AUC-ROC and MRR both offline and online, particularly benefiting cold-start users, while maintaining cost efficiency through model selection and prompt optimizations.
  • DoorDash - DoorDash implemented a GenAI system to replace its limited, heuristic-based homepage carousels with personalized content for millions of users. This system leverages LLMs to generate unique carousel titles and rich, user-specific metadata, which then drives embedding-based retrieval (exact KNN on GPU) to populate carousels with relevant stores and dishes, integrated via a blocked re-ranking approach. A/B tests demonstrated double-digit improvements in click rates, increased conversion, and enhanced merchant discovery, improving user engagement and platform health.
  • DoorDash - DoorDash developed an internal agentic AI platform to consolidate fragmented enterprise knowledge across diverse sources like wikis and databases, progressing from deterministic workflows to hierarchical multi-agent systems. The platform employs hybrid search with RRF re-ranking, schema-aware SQL generation using pre-cached examples, and zero-data statistical query validation for accuracy and trust. Integrated into existing tools like Slack and Cursor, it enables users to access complex data and automate tasks directly within their workflows, supported by LLM-as-judge evaluation and robust guardrails.
  • DoorDash - DoorDash developed a hybrid GenAI engine to solve context loss in item search, where nuanced user queries resulted in generic recommendations. This system combines FAISS-based embedding retrieval for rapid candidate generation with LLM-based reranking and dynamic carousel generation to provide personalized, context-aware recommendations. The approach achieved approximately six-second end-to-end latency and improved user satisfaction by balancing speed, cost, and personalization, proving more practical than pure LLM or deep neural network methods.
  • DoorDash - DoorDash's SafeChat platform utilizes AI and LLMs for real-time content moderation of millions of daily text messages, images, and voice calls between users. Its multi-layered architecture evolved to combine an efficient internal model with a precise external LLM, handling 99.8% of content with low latency and cost, while routing only 0.2% to more expensive models. This system has achieved a 50% reduction in low to medium-severity safety incidents, demonstrating effective scaling and cost optimization.
  • DoorDash - DoorDash developed a hybrid LLM-assisted personalization framework for multi-vertical retail discovery, combining traditional machine learning for scalable retrieval and ranking with LLMs for semantic understanding, collection generation, query rewriting, and knowledge graph augmentation across a vast product catalog. This framework strategically deploys LLMs at specific points in the pipeline to enhance user experience across familiarity, affordability, and novelty dimensions, while addressing scaling challenges through Hierarchical RAG and Semantic IDs for efficient, context-aware recommendations. The system is deployed across various discovery surfaces to provide personalized product suggestions.
  • DoorDash - DoorDash tackled the cold start problem for grocery recommendations by developing an LLM-based system that infers customer preferences from their restaurant order history. This solution leverages LLMs' semantic understanding and world knowledge to translate implicit culinary tastes and dietary patterns from restaurant orders into explicit, personalized grocery item suggestions. The system combines statistical analysis with LLM inference within a scalable, evaluation-driven pipeline to deliver relevant cross-domain recommendations from the first interaction.
  • DoorDash - DoorDash implemented an LLM-powered feature extraction system for never-delivered orders, utilizing a fine-tuned DistilBERT model that achieved superior F1 (0.8289) and lower latency compared to Llama 3 for binary classification of customer-Dasher conversations. Additionally, they developed a scalable RAG chatbot-as-a-service infrastructure, providing a Knowledge Base Management Service for automated embedding generation and a unified API for deploying customizable, knowledge-based chatbots with isolated collections and model migration capabilities.
  • DoorDash - DoorDash developed an LLMOps pipeline to automatically enhance its customer support knowledge base by identifying content gaps from escalated chat transcripts. This system uses semantic clustering to group similar issues, then employs LLMs to classify these clusters and generate draft knowledge base articles from agent resolutions, which are subsequently reviewed by human specialists. Deployed via RAG, this approach significantly reduced escalation rates for high-traffic clusters from 78% to 43%, demonstrating a mature human-in-the-loop LLM system.
  • DoorDash - DoorDash transitioned from opaque numerical embeddings to LLM-generated natural language profiles for consumers, merchants, and food items to enhance personalization and explainability. This system synthesizes structured data like order history and reviews via engineered prompts, creating human-readable descriptions that enable transparent recommendations, editable preferences, and richer input for downstream ML models. The production deployment leverages a "code for facts, LLM for narrative" principle, requiring robust data preparation, prompt engineering, and continuous evaluation to manage consistency, cost, and latency at scale.
  • DoorDash - DoorDash addressed challenges in scaling personalization and product catalog management for 100M+ SKUs across diverse retail verticals, particularly for cold-start scenarios, by integrating LLMs into their ML infrastructure. They employed fine-tuned LLMs for attribute extraction, RAG systems for categorization and hierarchical personalization, and LLM agents for ambiguous data, optimizing inference with model cascading and distillation. This approach automated product knowledge graph construction, enabled contextual personalization without historical data, and improved efficiency and accuracy while laying the groundwork for future agentic shopping experiences.
  • Dovetail - Dovetail, a customer intelligence platform, developed an MCP (Model Context Protocol) server to securely integrate its proprietary customer feedback data with external AI agents and tools. This server, built on JSON-RPC, exposes Dovetail's data as Resources, actions as Tools, and templates as Prompts, enabling AI-driven content generation and faster decision-making for product, success, and design teams. The solution addresses the challenge of connecting domain-specific data to enterprise AI workflows, positioning Dovetail as an AI-native data provider.
  • Dropbox - Dropbox's Dash AI, evolving into an agentic system, experienced "analysis paralysis" and performance degradation when given too many tool options and excessive context, leading to high token consumption and reduced accuracy. To address this, they implemented "context engineering" by consolidating all retrieval tools into a single universal search index, filtering context for relevance using a knowledge graph, and employing specialized agents for complex subtasks like query construction. These strategies improved decision-making efficiency, reduced token usage, and maintained the model's focus on the primary task.
  • Dropbox - Dropbox implemented a systematic, evaluation-first methodology for its conversational AI, Dropbox Dash, to combat unpredictable regressions in its complex multi-stage LLM pipeline. This involved curating diverse datasets, pioneering LLM-as-judge for actionable metrics beyond traditional NLP scores, and integrating structured metric enforcement with an evaluation platform. Automated evaluation was implemented across the entire development-to-production pipeline, using layered gates and continuous live-traffic scoring to ensure reliability and drive continuous improvement.
  • DTDC - DTDC replaced its rigid logistics agent with DIVA 2.0, a conversational AI agent built on Amazon Bedrock, to manage over 400,000 monthly customer queries. This solution leverages Amazon Bedrock Agents orchestrating Anthropic's Claude 3.0, RAG with OpenSearch-backed knowledge bases, and AWS Lambda for real-time API integrations with backend systems like tracking and pricing. The system achieved 93% response accuracy and reduced customer support workload by 51.4%, improving efficiency and user experience.
  • Duolingo - Duolingo developed an internal LLMOps platform to scale AI-assisted code changes beyond individual developer tools, enabling any employee to create and deploy AI coding agents without custom code. This platform uses JSON forms for workflow definition, a unified CodingAgent library to abstract LLM providers like Codex and Claude, and Temporal for robust orchestration of tasks like cloning repositories, making changes, and opening pull requests. It facilitates rapid deployment of agents for routine engineering tasks such as managing feature flags and infrastructure changes, allowing engineers to focus on higher-value work.
  • Duolingo - Duolingo implemented an AI agent to automate the removal of obsolete feature flags from their Python and Kotlin codebases, addressing technical debt. This agent leverages Anthropic's Codex CLI, orchestrated by Temporal workflows, allowing engineers to trigger automated code modifications via a self-service UI. The system clones repositories, uses AI to identify and remove flags, and automatically creates pull requests, establishing a foundational pattern for future autonomous coding agents developed rapidly in about one week.
  • Dust - This case study from Dust, based on experience with over 1,000 companies, argues that enterprises should generally buy rather than build AI agent infrastructure. Building custom solutions often leads to underestimated technical complexities, significant ongoing maintenance, and longer time-to-value (e.g., 6-12 months underestimation), diverting engineering resources from core business. In contrast, buying a platform enables rapid deployment (e.g., 20 minutes to functional agents, 70-95% adoption in 2-3 months) with enterprise-grade security, allowing companies to focus on strategic differentiation.
  • Dust.tt - Dust.tt developed synthetic filesystems, mapping disparate enterprise data sources like Notion and Slack into Unix-inspired hierarchies, after observing their AI agents spontaneously attempting structural navigation using filesystem-like syntax. This system provides agents with commands such as list, find, and cat (with context window management) to both structurally explore and semantically search information. This enables complex, multi-step investigative workflows, transforming agents into knowledge workers capable of contextual understanding beyond isolated semantic search.
  • Dust.tt - Dust.tt developed a distributed agent systems architecture to support complex, long-running AI agents, moving beyond traditional synchronous, stateless web architectures. Their solution employs a database-driven communication protocol between agent components, a versioning system across PostgreSQL tables for idempotency and state persistence, and Temporal for durable workflow orchestration. This enables reliable, scalable, and fault-tolerant deployment of AI agents capable of multi-step tasks while surviving failures and preventing duplicate actions.
  • eBay - eBay's Mercury is an agentic AI platform designed for deploying LLM-powered recommendation systems at industrial scale across its two billion active listings. It uses a modular agent framework, integrates Retrieval-Augmented Generation (RAG) for real-time data, and features a custom Listing Matching Engine that maps LLM text outputs to live inventory via hybrid retrieval methods including semantic search. The platform employs a near-real-time, distributed queue-based execution system for efficient resource management and adheres to rigorous engineering practices for prompt management and deployment.
  • Electrolux - Electrolux developed "Infra Assistant," a multi-agent AI system using Amazon Bedrock to augment their SRE team, addressing bottlenecks in developer support and infrastructure operations. This system evolved from RAG-based knowledge agents to a supervisor-orchestrated architecture integrating specialized agents for API calls, custom actions (e.g., AWS CLI), and onboarding automation, dynamically configured via inline agents and Model Context Protocol. It successfully answers context-specific questions, executes operations, and troubleshoots cloud issues, improving efficiency despite challenges with latency and accuracy.
  • ElevenLabs - ElevenLabs optimized their production RAG system, which ran on every query, by addressing a bottleneck in query rewriting that caused over 80% of RAG latency due to a single external LLM. They implemented a model racing architecture, sending queries to multiple parallel models including self-hosted Qwen 3-4B and 3-30B-A3B, and using the first valid response. This strategy reduced median RAG latency by 50% (from 326ms to 155ms) and enhanced system resilience through redundancy.
  • Elyos AI - Elyos AI developed specialized voice AI agents for home services to automate 24/7 customer interactions like emergency booking and payments, addressing the challenge of achieving human-like conversation latency and reliability. They achieved this by employing a cascade architecture and sophisticated orchestration focused on latency optimization (e.g., regional clustering, dynamic LLM routing), just-in-time context engineering, state machine-based workflows, and real-time parallel monitoring. This approach resulted in approximately 85% call automation, with human involvement for complex or high-value scenarios.
  • eSpark - eSpark developed an AI-powered teacher assistant to align its K-5 adaptive learning content with mandated core curricula, addressing a post-COVID shift where administrators prioritized core textbooks. This RAG-based system uses Pine Cone for semantic search of eSpark activities (enhanced by LLM-generated metadata) and direct retrieval for curriculum text, enabling teachers to quickly find relevant supplemental materials. The solution, built with Brain Trust for LLMOps, evolved from an open conversational interface to a structured workflow with AI-generated follow-up questions, significantly improving teacher efficiency and eSpark's relevance.
  • Etsy - Etsy implemented an LLM-powered system to generate structured buyer profiles from user behavioral data, such as searches and purchases, to personalize experiences for its nearly 90 million buyers. Through significant LLMOps optimizations including data pipeline improvements, prompt engineering, and batch processing, the system achieved a 94% cost reduction per million users and faster profile generation, enabling applications like query rewriting and refinement pills at scale.
  • Etsy - Etsy Engineering implemented a Retrieval-Augmented Generation (RAG) architecture with embeddings-based search to provide AI-assisted employee onboarding, focusing on prompt engineering rather than fine-tuning. Leveraging foundation models, the system achieved 86% accuracy for internal Travel & Entertainment policy questions and 72% for external community forum support. Chain-of-thought reasoning and source citation were crucial prompt engineering techniques to mitigate hallucinations and improve answer reliability in production.
  • EV Trading Cards / Snorlax Spar Breaks - An e-commerce case study details implementing AWS Fraud Detector, a managed ML service, for fraud detection, addressing various fraud types like promo abuse and account hijacking. The system uses historical data to generate fraud likelihood scores (0-1000) for orders, necessitating human review for GDPR compliance due to its black-box nature. Operational aspects include evaluating true/false positive rates, feature engineering, deploying via SageMaker or Lambda, and adaptive model retraining based on fraud trend velocity, despite a batch processing limitation.
  • Exa - Exa developed a multi-agent web research system using LangGraph for orchestration and LangSmith for observability, processing hundreds of daily queries with response times from 15 seconds to 3 minutes. Its architecture features a Planner for dynamic task generation, independent Task agents with specialized tools producing structured JSON, and an Observer for system-wide context management. The system optimizes token usage by initially reasoning on search snippets and only retrieving full content when necessary, ensuring structured JSON outputs for API consumption.
  • Exa.ai - Exa.ai developed a search engine specifically for AI agents, moving beyond human-centric keyword search to semantic understanding and raw data retrieval. They achieved this by owning their GPU cluster, building a proprietary web index, and training custom models, enabling full-stack optimization for latency, privacy, and research flexibility. Their offerings include a tiered API for various AI application needs and an agentic tool for complex, multi-criteria web research.
  • Explai - Explai, building AI analytics agents, initially struggled with context window pollution and instruction following degradation in production due to pre-loading extensive information into LLM contexts. They addressed this by implementing strategic prompt engineering tactics: reversing RAG to use pull-based document retrieval with concise triggers, writing structured artifacts to a backend instead of raw data into context, and enabling full code generation in sandboxed environments for specific tasks. These methods significantly reduced token consumption, improved context management, and enabled more robust, autonomous multi-step analytical workflows for enterprise data.
  • FanDuel - FanDuel developed AAI, an in-app AI betting assistant, to address customer friction where users left the app for external research, causing distractions and missed betting opportunities. Built on AWS Bedrock with a RAG architecture, serverless components (Lambda, DynamoDB for context, Redis for caching), and multi-intent routing, AAI reduced complex bet construction time from hours to seconds. The system incorporates responsible gaming safeguards, manages conversation history, and uses a rigorous evaluation framework for incremental deployment in a highly regulated sports wagering environment.
  • FemmFlo - FemmFlo rapidly developed an AI-powered hormonal health platform in eight weeks using AWS Bedrock and managed services to address long diagnostic delays for women. The platform features an AI agent, Gabby, which provides personalized care, interprets lab results, and offers culturally relevant health guidance, all built with robust LLMOps practices including systematic evaluation and a controlled testing environment. This enabled a small team to deploy a production-ready system for real users in a regulated healthcare domain.
  • Fidelity Investments - Fidelity Investments built CENTS, an event-driven data pipeline, to ingest, enrich, and route cloud health events and support cases from 2,000+ AWS accounts, establishing a robust foundation for operational intelligence. On top of CENTS, they developed the MAKI framework using Amazon Bedrock, which leverages generative AI for event summarization, aggregate trend analysis, and agentic workflows, including proactive vulnerability detection and automated code fixes. This system achieved a 57% cost reduction, improved targeted notifications, and enabled proactive incident prevention by correlating patterns across their vast infrastructure.
  • Fitbit - Fitbit, in collaboration with Google, developed an AI-powered personal health coach using Gemini models within a multi-agent framework to deliver personalized, adaptive guidance. This system features a conversational orchestrator, a data science agent for numerical reasoning on physiological time series data, and domain expert agents. It underwent extensive validation via the SHARP framework, involving over a million human annotations and 100,000 hours of expert evaluation, before its public preview for Fitbit Premium users.
  • Fitch Group - Fitch Group deploys agentic AI in financial services by focusing on augmentation, not full automation, integrating LLMs with traditional ML and knowledge graphs for numerical accuracy and grounding. Production readiness demands rigorous upfront evaluation frameworks, comprehensive observability with multi-stage testing and logging, and a significant "data prep tax" for data preparation and versioning. They emphasize hybrid architectures, human-in-the-loop validation, and strong business-technical partnerships to define success and manage the non-deterministic nature of agent systems in a highly regulated environment.
  • Fortive - Capgemini and AWS developed "Fort Brain," a multi-tenant AI chatbot platform for Fortive's industrial conglomerate to standardize AI capabilities across its independent operating companies, addressing disparate data sources and manual update challenges. Built on AWS Bedrock with a serverless architecture (Fargate, Lambda, API Gateway), it uses a multi-agent system and Model Context Protocol (MCP) to dynamically query live structured databases, unstructured documents via Bedrock Knowledge Bases, and software repositories. This enables non-technical users to access real-time operational data across all OpCos, eliminating manual schema remapping and providing rapid responses.
  • Georgia-Pacific - Georgia-Pacific deployed an "Operator Assistant" using a RAG architecture on AWS Bedrock to address critical knowledge transfer gaps in manufacturing operations. This system integrates structured time-series data from PI historians with unstructured documentation and tacit knowledge captured via a custom Docgen tool, providing real-time operational guidance to factory operators. Rapidly scaled from concept to production in 6-8 weeks, it now serves 500+ users across 45 sites, demonstrating improved operational efficiency and reduced waste, with future plans for autonomous agents.
  • GetOnStack - GetOnStack's multi-agent LLM system for market data research incurred a $47,000 cost disaster within a month due to an undetected 11-day infinite conversation loop between agents. This failure exposed critical infrastructure gaps, including a lack of real-time cost monitoring, circuit breakers, and conversation tracing, highlighting the immaturity of current multi-agent LLMOps tooling. Consequently, GetOnStack developed extensive production infrastructure and is now building a platform to provide these essential safeguards and observability for other multi-agent deployments.
  • GetYourGuide - GetYourGuide scaled product categorization for 250,000 products across 600 categories by transitioning from manual and semantic NLP methods to a hybrid LLM-based system. This solution leverages OpenAI's GPT-4-mini with structured outputs for one-product-one-category classification, optimized by embedding-based pre-ranking, OpenAI batch jobs for cost efficiency, and custom early stopping, all orchestrated via Apache Airflow. The implementation significantly improved categorization quality (higher MCC, expanded category coverage) and business metrics, yielding a 1.3% conversion rate increase and reduced bounce rate in A/B tests.
  • GitHub - GitHub Copilot's evaluation team developed a comprehensive system for its large-scale AI product, employing a code-testing harness, A/B testing, and implicit user behavior metrics to assess model changes. This revealed a critical gap between data science-focused evaluation tools and engineering workflows, leading to a production-first philosophy that prioritizes online evaluation, automated trace analysis, and rapid iteration over extensive offline testing to align with real-world development practices.
  • Glean / Deloitte / DocuSign - A panel from Glean, Deloitte, and DocuSign discussed enterprise LLM and agentic AI deployment challenges, emphasizing that organizational complexity and data silos are greater hurdles than technical ones, necessitating human-in-the-loop oversight due to trust issues with full autonomy. Key recommendations included early security team involvement, robust governance for agent creation and sharing, leveraging data lakes for secure data access, and prioritizing business value measurement through reimagined workflows rather than focusing solely on specific LLM models.
  • GlowingStar Inc. - GlowingStar Inc. develops emotionally aware AI tutoring agents by integrating multimodal affect detection into an expanded agent architecture. This system uses a multimodal perception layer to analyze voice, facial expressions, and interaction patterns, feeding into an explicit emotional modeling module that estimates user affective states. These states inform reasoning, planning, and emotional tagging in memory, enabling personalized learning adaptation while addressing challenges like real-time signal fusion and ethical data handling.
  • Goodfire - Goodfire deployed AI agents, termed "experimenter agents," for interpretability research across domains like genomics and diffusion models, distinguishing them from "developer agents." Their technical solution involves MCP-based Jupyter notebook integration, providing agents with interactive, stateful access to execute code iteratively and autonomously conduct complex experiments, such as rediscovering scientific features. Despite successes in diverse research tasks, significant challenges persist in validating agent outputs, preventing reward hacking, and managing context, necessitating human oversight and critic systems.
  • Google - The case study details the rapid evolution of production LLM agents from basic function calling to complex multi-step reasoning, driven by model advancements like Gemini. This necessitates continuous architectural rebuilds of agent harnesses, where old defensive code is removed as models gain capabilities, while new orchestration is added for more ambitious tasks. Robust, automated evaluation infrastructure is highlighted as a critical competitive differentiator, enabling rapid iteration, informing architectural decisions, and adapting to fast-paced model improvements and diverse deployment scenarios.
  • Google - Google Research developed Wayfinding AI, built on Gemini 2.5 Flash, to help users navigate complex health information by proactively asking clarifying questions to understand user context, rather than immediately providing comprehensive answers. This system uses sophisticated prompt engineering to orchestrate question generation, best-effort answers, and transparent reasoning, presented in a two-column interface. User studies demonstrated Wayfinding AI was significantly preferred over a baseline LLM for helpfulness, relevance, and tailoring, validating that a context-seeking conversational approach enhances user experience in sensitive domains.
  • Google - Google Photos transitioned from on-device machine learning using small, specialized models for features like background blur to cloud-based generative AI for its Magic Editor. This shift enabled complex image manipulations such as object relocation and scene reimagination, necessitating a complete architectural overhaul to manage cloud infrastructure, latency, and develop new evaluation methodologies for inherently subjective generative outputs. To ensure production reliability, they focused on guided user experiences and constrained problem scopes, blending generative AI with traditional engineering for grounded, memory-preserving edits.
  • Google Cloud / Microsoft / InWorld AI - This case study details how Google Cloud, Microsoft, InWorld AI, and IUD/Prosus are hardening AI agents for e-commerce, shifting from prompt engineering to post-training methods like DPO and PEFT for reliability. Google Cloud used DPO to boost support agent policy adherence from 45% to 90%, while Microsoft relies entirely on post-training for Copilot but faces challenges with UI localization for computer-use agents and high token costs. InWorld AI improved voice agent tool calling with cascaded architectures, and IUD grapples with balancing personalization in multi-channel e-commerce agents.
  • Google DeepMind - Google DeepMind launched Anti-gravity, an agent-first AI development platform leveraging Gemini 3 Pro for complex, long-running software tasks. It features multi-surface orchestration across an AI editor, agent-controlled browser, and central agent manager, introducing "artifacts" as dynamic primitives for agent output organization and asynchronous human feedback. Developed through a tight research-product feedback loop, the platform faced immediate capacity constraints post-launch despite its advanced LLMOps design.
  • Google DeepMind - Google DeepMind integrated native image generation directly into Gemini 2.5 Flash, enabling "interleaved generation" where the model maintains full multimodal context across conversation turns for iterative image refinement. This architecture allows pixel-perfect editing and consistent character rendering across poses, with images generated in approximately 13 seconds. Production challenges were addressed through a multi-faceted evaluation combining human preference with proxy metrics like text rendering quality, and by systematically incorporating real user failure cases into benchmarks.
  • Government of the City of Buenos Aires / B - The City of Buenos Aires enhanced its "Boti" WhatsApp AI assistant using LangGraph and Amazon Bedrock to help citizens navigate complex government procedures. The system features custom input guardrails and a novel reasoning retrieval approach that generates comparative summaries and uses LLM-based disambiguation for similar procedures, achieving 98.9% top-1 retrieval accuracy. This advanced agentic architecture processes 3 million conversations monthly, delivering culturally localized responses.
  • Gradient Labs - Gradient Labs' AI agent system, deployed on Google Cloud Run with Temporal workflows, experienced production incidents due to high memory usage and container crashes. Investigation revealed the root cause was an oversized Temporal workflow cache, which was resolved by tuning cache parameters. However, this fix inadvertently caused auto-scaling issues, as Cloud Run's instance scaling, previously triggered by the crashes, now under-provisioned resources, leading to increased agent latency.
  • Groq / NVIDIA / AMD / Lambda - A panel of infrastructure leaders discussed the unique production challenges for AI agents, highlighting requirements for extremely low latency, high token generation, and diverse infrastructure spanning edge to cloud. They emphasized a shift from training-centric to inference-centric hardware, advocating for specialized, efficient architectures, smaller distilled models, and the critical role of full-stack engineering and task-specific evaluation for optimizing these complex, heterogeneous systems.
  • H2O.ai - H2O.ai, operating an enterprise AI platform on Kubernetes with AWS EBS, faced significant storage overprovisioning, utilizing only 25% of their 2 petabytes, leading to high costs and operational inefficiency for LLM and AI workloads. They adopted Datafi's autonomous storage management solution, which dynamically scales EBS volumes without downtime and integrates with their existing Terraform and GitOps workflows. This resulted in improved storage utilization from 25% to 80%, reducing their footprint from 2 petabytes to under 1 petabyte, while also enhancing customer performance.
  • HackAPrompt / LearnPrompting - Sandra Fulof's work through LearnPrompting and HackAPrompt established foundational resources for production LLM security and prompt engineering. LearnPrompting provided systematic education on prompt engineering, while HackAPrompt created the first AI red teaming competition platform, collecting 600,000 attack prompts that became an industry standard dataset. This work revealed the ineffectiveness of traditional prompt-based defenses against prompt injection, highlighted the probabilistic nature of LLM security, and underscored the need for continuous red teaming and model-specific optimization in production deployments.
  • Handmade.com - Handmade.com automated product description generation to address scalability and quality issues for its 60,000+ unique items, which previously required 10 hours of manual effort weekly. They implemented an LLMOps pipeline using Amazon Bedrock with Anthropic Claude 3.7 Sonnet for multimodal content generation, Amazon Titan Text Embeddings V2, and Amazon OpenSearch Service for vector storage. This solution leverages Retrieval Augmented Generation (RAG) with 1 million existing product embeddings and persona-based prompt engineering, significantly reducing manual processing, improving content quality, and enabling sub-one-hour listing times.
  • Harman International / Axfood - Harman International leveraged AWS Bedrock and Amazon Q Developer with Anthropic Claude models to automatically generate documentation for 30,000 poorly documented custom SAP ABAP objects during an S/4HANA migration. Through iterative prompt engineering, this generative AI solution transformed initial technical outputs into structured, multi-tier documentation suitable for business, functional, and technical stakeholders. This approach dramatically reduced the documentation timeline from 15 months to 2 months, achieving a 6-7x speed improvement and over 70% cost reduction compared to manual efforts.
  • Harvey - Harvey, a legal AI platform, rapidly integrated OpenAI's Deep Research API within 12 hours of its release, showcasing efficient LLMOps. This was enabled by their AI-native architecture, featuring a modular Workflow Engine for orchestrating agent behavior and composing AI building blocks. Their approach also leverages AI-assisted development and incorporates transparency features like "thinking states" and robust citation systems to ensure reliability and user understanding in legal applications.
  • Harvey - Harvey developed a sophisticated AI infrastructure for legal AI applications, processing billions of prompt tokens daily across multiple models. This infrastructure centers on a centralized Python library for model orchestration, featuring intelligent endpoint selection with weighted algorithms, Redis-backed distributed rate limiting, and a proxy service for secure developer access. Comprehensive observability and parallel deployments ensure high availability, performance, and cost tracking for their enterprise-scale legal AI products.
  • Harvey - Harvey, a legal AI platform, implements enterprise-grade Retrieval-Augmented Generation (RAG) systems to process sensitive legal documents across 45 countries, prioritizing data privacy and compliance. They utilize LanceDB Enterprise for scalable vector storage, enabling decentralized data isolation in customer-controlled cloud buckets, and achieve sub-2-second latency for 15 million embeddings. This architecture, combined with domain expert collaboration, resulted in a Tax AI Assistant outperforming ChatGPT by 91% preference.
  • Harvey - Harvey, a legal AI company, developed a three-pillar evaluation strategy for its high-stakes legal AI systems, combining direct expert-led reviews with automated evaluation pipelines for continuous monitoring and rapid iteration. This approach includes specialized techniques like a Knowledge Source Identification system achieving over 95% accuracy in citation verification using custom embeddings and LLM matching, alongside a dedicated data service for secure, versioned evaluation data. This comprehensive methodology ensures rigorous quality and enables statistically significant improvements in a complex, high-consequence domain.
  • Harvey - Harvey developed a Microsoft Word Add-In for AI-powered, document-wide editing of 100+ page legal documents through a single query. The system uses a reversible mapping to translate complex OOXML to natural language for LLM processing and an orchestrator-subagent architecture to overcome long-context limitations by decomposing tasks into bounded chunks. This transforms hours of manual legal editing into seamless interactions, enabling complex operations like contract conformance and template creation.
  • Harvey - Harvey implemented a large-scale RAG system with LanceDB to process legal documents, ranging from small uploads to tens of millions of documents, using an AI-native multimodal lakehouse architecture. This system leverages LanceDB's format for efficient storage and GPU indexing of billions of vectors, enabling complex legal queries while adhering to strict evaluation, security, and privacy standards.
  • HoneyBook - HoneyBook transformed its small business CRM onboarding with an AI agent, replacing a static questionnaire with a personalized conversational experience. This agent utilizes RAG with dynamic retrieval modes for knowledge, employs action-execution tools to generate tailored contracts and invoices, and manages conversation flow with explicit goals, all orchestrated on Temporal infrastructure with custom tool strategies. The implementation led to a 36% increase in trial-to-paid subscription conversion rates.
  • HubSpot - HubSpot developed a remote, stateless Model Context Protocol (MCP) server as a Java Dropwizard microservice to enable AI agents like ChatGPT to securely access CRM data for millions of users. This involved extending the Java MCP SDK for HTTP streaming, integrating with existing REST APIs, and implementing OAuth 2.0 for authentication and user permission mapping, delivered in under four weeks. The read-only solution provided scalable, enterprise-grade access, allowing natural language queries for CRM analytics while navigating an evolving protocol landscape.
  • HubSpot - HubSpot developed the first production-ready CRM integration for ChatGPT using the Model Context Protocol (MCP), building an in-house remote MCP server with Java and Dropwizard. This solution democratizes AI access for over 250,000 businesses by implementing OAuth-based user-level permissions, a distributed service discovery system for automatic tool registration, and a custom query DSL for reliable CRM search generation by AI models. They also engineered custom Streamable HTTP support to align with their stateless infrastructure, ensuring enterprise-grade security and scalability.
  • Huron Consulting Group - Huron Consulting Group implemented generative AI using Amazon Bedrock and the Nova LLM within their AWS architecture to address delayed patient experience feedback and manual unstructured data analysis in healthcare. Their solution performs sentiment analysis on patient rounding notes and extracts insights from business operations text, processing over 10,000 notes weekly with 90% accuracy. This enables real-time intervention for patient dissatisfaction and scalable identification of revenue opportunities, directly impacting hospital funding and operational efficiency.
  • iFood - iFood developed Ailo, an AI-powered food ordering agent for millions of users in Brazil, designed to combat decision paralysis through hyperpersonalized recommendations and autonomous actions like applying coupons and managing carts. This agentic system, deployed across the iFood app and WhatsApp, features a multi-agent-like architecture with domain-specific tools, sophisticated context management, and optimizations that reduced P95 latency from 30s to 10s. The team also addressed prompt bloat by improving tool naming and implemented a multi-layered evaluation framework including natural language scenario testing.
  • iHeart Media - iHeart Media, managing a vast digital media platform with billions of monthly requests on AWS, automated incident response to address slow manual triage, engineer burnout, and tribal knowledge dependencies. They deployed a multi-agent AI system using AWS Bedrock Agent Core and Strands AI, where a coordinator agent delegates tasks to specialized sub-agents that leverage isolated context windows to perform deep dives and return concise summaries. This architecture reduced incident triage time to 30-60 seconds, improved operational efficiency by automating root cause analysis and safe remediation, and systematically preserved institutional knowledge.
  • Incident.io - Incident.io developed an AI SRE product that automates incident investigation using a multi-agent system mimicking human reasoning. This system performs parallel searches across diverse data sources like code changes, logs, and historical incidents to generate findings, formulate hypotheses, and ask clarifying questions via sub-agents. It then presents actionable insights and recommendations in Slack within minutes, significantly reducing incident response time and cognitive load.
  • Indegene - Indegene developed an AI-powered social intelligence solution for life sciences to analyze complex medical discussions on social media at scale, addressing the challenge of pharmaceutical companies struggling to derive insights from healthcare professional digital engagement. This solution employs a sophisticated four-layer LLMOps architecture leveraging Amazon Bedrock for foundation models (fine-tuning, RAG, intelligent routing, guardrails) and a taxonomy-based query generator using medical terminology databases. It transforms unstructured data into actionable business intelligence, enabling brand monitoring, adverse event detection, and competitive intelligence while ensuring strict regulatory compliance and improving time-to-insight.
  • Infosys - Infosys and AWS developed a multimodal RAG solution to process complex, diverse technical documentation in the oil and gas industry, which includes text, images, charts, and diagrams. This solution leverages Amazon Bedrock, OpenSearch Serverless, and advanced techniques like parent-child hierarchy chunking, hybrid search, and multi-vector retrieval to accurately extract insights and handle domain-specific terminology. The system achieved 92% retrieval accuracy, sub-2-second response times, and delivered significant operational efficiencies, including a 40-50% reduction in manual processing costs.
  • INRIX - INRIX developed an AI-powered solution for Caltrans to improve transportation safety by identifying high-risk locations for vulnerable road users and generating safety countermeasures. Leveraging INRIX's 50 petabyte data lake and Amazon Bedrock (Claude for RAG-based recommendations, Nova Canvas for AI-generated visualizations), the system automates the process of proposing and visualizing interventions. This technical approach drastically reduces design cycles from weeks to days, accelerating the deployment of empirically validated safety measures and enhancing planning efficiency.
  • Instacart - Instacart developed Maple, a centralized batch processing platform, to efficiently handle millions of LLM prompts for internal applications like catalog enrichment and search ranking, overcoming the limitations and high costs of real-time APIs. This platform orchestrates large-scale LLM jobs using S3/Parquet for optimized data storage, Temporal for fault tolerance, and an AI Gateway for multi-provider abstraction, automatically managing batching, retries, and provider-specific encoding to achieve significant cost savings and reliable, scalable LLM operations.
  • Instacart - Instacart integrated LLMs into its grocery e-commerce search to enhance query understanding, particularly for tail queries and product discovery, by augmenting existing ML models like query-to-category classifiers and query rewrite systems. This hybrid approach combined LLM capabilities with Instacart-specific domain knowledge and user behavior data, employing a dual-serving architecture that pre-computed results for common queries and used real-time inference for long-tail ones. The implementation yielded an 18 percentage point precision and 70 percentage point recall improvement for tail queries, substantially reducing zero-result queries and increasing engagement with discovery-oriented content.
  • Intel / Lmsys - Intel PyTorch Team and SGLang developed a CPU-only deployment solution for large Mixture of Experts (MoE) models, like DeepSeek R1 (671B parameters), on Intel Xeon 6 processors to provide a cost-effective alternative to expensive GPU-based inference. This involved optimizing SGLang with native CPU backend support, leveraging Intel AMX, and implementing advanced techniques for attention mechanisms, MoE layers, and multi-NUMA parallelism, supporting BF16, INT8, and FP8 quantization. The solution achieved 6-14x faster time-to-first-token and 2-4x faster time-per-output-token compared to llama.cpp, demonstrating competitive performance for deploying massive LLMs on commodity hardware.
  • Intercom - Intercom transformed from a struggling SaaS company to an AI-first agent business by rapidly developing Fin, an AI customer service agent, in six weeks after GPT-3.5's launch. This involved a $100M investment, a complete business model shift to outcome-based pricing (99 cents per resolved ticket), and solving LLMOps challenges to optimize unit economics from negative to profitable. This strategic pivot and operational overhaul led to Fin's 300% year-over-year growth and projected $100M ARR, revitalizing the company.
  • Intercom - Intercom rapidly developed Finn Voice, a production voice AI agent for customer support, in 100 days, extending their existing text-based Finn agent. It employs a speech-to-text, language model, text-to-speech (STT-LM-TTS) architecture with RAG for knowledge retrieval, integrating with real-time APIs and existing telephony. The system addresses voice-specific challenges like latency and response length, focusing on LLMOps for workflow integration, evaluation, and internal tooling to achieve significant cost savings and 24/7 availability.
  • Intuit - Intuit deployed a large-scale LLM assistant, Intuit Assist, on its proprietary GenOS platform for TurboTax, serving 44 million annual tax returns to provide explanations of tax situations, deductions, and refunds. This system uses a multi-model strategy (Claude for core explanations, GPT for Q&A), integrates with proprietary tax engines via traditional and Graph RAG, and employs strict safety guardrails to separate LLM explanations from actual calculations. A comprehensive, multi-phase evaluation framework involving human tax experts and LLM-as-judge systems ensures accuracy and compliance in the highly regulated financial domain.
  • Jefferies - Jefferies Equities deployed an AI Trade Assistant on Amazon Bedrock to enable front-office traders to query millions of fragmented trading records using natural language. This solution, leveraging Amazon Titan embeddings and Strands agents, integrates into their existing BI platform, generating SQL queries against an in-memory database (GridGain) and deterministic visualizations via a Python library. A beta rollout demonstrated an 80% reduction in time spent on routine analytical tasks and high user adoption, significantly improving data accessibility and reducing IT burden.
  • Jellyfish - Jellyfish analyzed 20 million pull requests from 1,000 companies, revealing that median AI coding tool adoption grew from 22% to 90%, correlating with approximately 2x gains in PR throughput and a 24% reduction in cycle time without quality degradation. Crucially, productivity gains varied significantly by architecture, with centralized systems seeing 4x improvements while highly distributed architectures showed minimal impact due to AI tools' context limitations across multiple repositories, and autonomous agents saw less than 2% production usage.
  • JetBlue - JetBlue addressed challenges in manually tuning complex, multi-stage LLM pipelines for applications like customer feedback classification and RAG chatbots by adopting DSPy. Integrated with Databricks Model Serving and Vector Search, DSPy enabled automated optimization of prompts and in-context learning examples against defined metrics, replacing manual prompt engineering. This systematic approach resulted in 2x faster RAG chatbot deployment compared to previous Langchain implementations and improved the reliability and efficiency of their LLM applications.
  • Jimdo - Jimdo developed an AI-powered business assistant, Jimdo Companion, for its solopreneur customers to improve website traffic and conversions. Leveraging LangChain.js and LangGraph.js for multi-agent orchestration and LangSmith for observability, the system integrates a dashboard querying 10+ data sources and a conversational assistant that adapts to user tone. This implementation resulted in a reported 50% increase in first customer contacts and 40% more overall customer activity.
  • Komodo Health - Komodo Health developed a multi-agent healthcare analytics assistant enabling natural language queries against its proprietary medical events database. The system evolved to a hybrid architecture where an agentic supervisor intelligently routes user requests to either deterministic workflows or specialized sub-agents, ensuring final analytical outputs originate directly from database APIs to prevent hallucinations. This approach balances the flexibility of autonomous agents with the control and cost-efficiency of deterministic code, prioritizing trust and accuracy in a high-stakes domain.
  • LangChain - LangChain developed five evaluation patterns for their production "Deep Agents" (stateful, multi-step AI agents) to address the limitations of traditional LLM testing. These patterns include bespoke test logic with custom assertions, single-step decision validation, full end-to-end turn testing, multi-turn conditional conversations, and reproducible environments with API mocking. Leveraging LangSmith's integrations, this approach enables flexible, debuggable, and efficient assessment of agent trajectories, final outputs, and state artifacts.
  • LangChain - LangChain rebuilt its production documentation chatbot, which previously used vector embeddings and suffered from fragmented context and reindexing issues, by adopting a multi-agent architecture. This new system features a fast CreateAgent for simple queries and a Deep Agent with specialized subgraphs for complex investigations, both leveraging direct API access to structured content (docs, KB, codebase) for iterative, human-like search instead of similarity-based retrieval. This approach resulted in sub-15-second responses with precise citations, eliminated reindexing overhead, and improved internal adoption for complex technical troubleshooting.
  • Langchain - Langchain's LangSmith platform offers comprehensive LLMOps tooling for managing production-grade AI agents, addressing challenges in scaling LLM applications beyond prototyping. Key features include automated Insights, which discovers patterns and anomalies from millions of production traces to understand user behavior and agent performance, and thread-based evaluations, enabling assessment of multi-turn interactions and complete user sessions. These capabilities aim to bring rigor to LLM application development and deployment, facilitating the transition from informal testing to methodical production operations.
  • Langchain / Manis / Anthropic - This case study details the evolution of LLMOps from model training to orchestrating rapidly improving foundation models, necessitating continuous system rearchitecture due to the "Bitter Lesson." It emphasizes critical context engineering techniques (reduce, offload, isolate) to manage token usage, cost, and prevent context degradation, alongside architectural choices between workflows for predictable tasks and agents for open-ended problems. Effective evaluation for non-deterministic LLMs relies on robust tracing, user feedback, and continuously evolving evaluation sets rather than static benchmarks, underscoring the need for systems built for rapid change.
  • Lexbe - Lexbe developed Lexbe Pilot, an AI-powered Q&A assistant for legal document review, to analyze massive document collections (100k-1M+) where traditional search is insufficient. The solution employs a RAG-based LLMOps architecture on Amazon Bedrock, utilizing Titan Text v2 for embeddings, OpenSearch for indexing, and Sonnet 3.5 for generation, deployed via AWS Fargate. Through an eight-month iterative optimization process, including reranker technology, the system achieved a 90% recall rate, enabling deep automated inference and comprehensive report generation across multilingual documents.
  • LexMed - LexMed developed an AI platform leveraging LLMs and RAG to automate legal document analysis and hearing transcription for Social Security disability law. The system analyzes thousands of pages of medical records to map clinical findings to complex regulatory requirements using "mega prompts" and audits administrative hearings by transcribing and using function calling to cross-reference vocational expert testimony against job databases, identifying procedural errors and outdated job suggestions.
  • LiftOff LLC - LiftOff LLC self-hosted DeepSeek-R1 models (1.5B-16B) on AWS EC2 GPU instances using Docker, Ollama, and OpenWeb UI to evaluate replacing commercial AI services. While technically deployable, larger models faced significant memory, stability, and performance issues with longer contexts, requiring extensive tuning and quantization. The economic analysis showed self-hosting was not cost-effective for startup scale, costing $414/month for a single g5g.2xlarge instance compared to $20/user/month for SaaS, making commercial LLMs a superior value proposition.
  • Linear - Linear developed a Slack-integrated LLM agent for issue creation and data querying, addressing challenges like LLM accuracy and context management within platform constraints. They implemented early intent classification to route requests to specialized subsystems, provided localized conversation context, and programmatically handled complex business logic and formatting, reserving the LLM for generative tasks like summarization. This approach improved issue creation accuracy, response times, and overall reliability by leveraging the LLM's strengths while mitigating its weaknesses with deterministic code.
  • LinkedIn - LinkedIn's AI agent, Hiring Assistant, faced high latency generating long, structured outputs (1,000+ tokens) from large inputs, necessitating inference optimization. They implemented n-gram speculative decoding within their vLLM serving stack, a technique that drafts and verifies multiple tokens in parallel, leveraging the structured and repetitive nature of their outputs for acceleration. This resulted in 4x higher throughput and a 66% reduction in P90 end-to-end latency, with no degradation in output quality.
  • LinkedIn - LinkedIn deployed a multi-agent system, Hiring Assistant, in production, utilizing a supervisor pattern with four specialized agents to streamline the recruiting workflow. The case study details operational challenges in scaling to autonomous agents, focusing on memory isolation for multi-user contexts, robust tool discovery and safety validation for destructive actions, and computational efficiency via complexity-based request routing to optimize GPU usage.
  • LinkedIn - LinkedIn developed an agentic LLM-powered AI Hiring Assistant integrated into LinkedIn Recruiter to streamline candidate screening for enterprise recruiters. This assistant orchestrates complex workflows, maintains conversational memory, and reasons over diverse data sources including external ATS, employing transparency mechanisms and human-in-the-loop design for trust and control. Through a curated rollout and continuous learning from user signals, the system achieved significant efficiency gains, reducing application review time by 48% and increasing InMail acceptance rates by 69%.
  • LinkedIn - LinkedIn extended its existing GenAI platform to build production-scale AI agents capable of complex, long-running tasks with human oversight. They leveraged existing gRPC for agent definitions and a central skill registry, using their messaging system for multi-agent orchestration to ensure consistency, scalability, and resilience. The platform integrates LangGraph, sophisticated observability via OpenTelemetry and LangSmith, experiential memory, and robust security measures to support autonomous and semi-autonomous agent workflows.
  • LinkedIn - LinkedIn deployed vLLM across thousands of hosts to power over 50 GenAI applications like Hiring Assistant and AI Job Search, addressing requirements for thousands of QPS and sub-600ms p95 latency. Through a five-phase evolution, they leveraged vLLM's PagedAttention, optimized parameters like --num-scheduler-steps and ENABLE_PREFIX_CACHING, and re-architected for an OpenAI-compatible API. This resulted in approximately 10% TPS improvements and over 60 GPU savings for specific workloads, demonstrating mature LLMOps for high-throughput LLM serving.
  • LinkedIn - LinkedIn developed a 150B parameter foundation model, "Brew XL," to unify fragmented recommendation systems across its platform, using "promptification" to convert user data into LLM-processable prompts. This large model was then gradually distilled and optimized through multi-stage pruning, mixed precision quantization, and attention sparsification to achieve production-ready models (e.g., 3B parameters) capable of high QPS and sub-second latency. The system demonstrated zero-shot generalization for new tasks and improved cold-start user performance, significantly reducing latency and increasing throughput.
  • Loblaws - Loblaws Digital developed Alfred, a production-ready agentic orchestration layer to deploy AI workflows across its e-commerce and retail platforms. It addresses the challenge of moving agent prototypes to enterprise production by providing a template-based architecture utilizing LangGraph, FastAPI, and GCP, integrating with 50+ internal APIs via a Model Context Protocol (MCP). This system enables rapid deployment of conversational commerce applications with built-in security, privacy, observability, and cost management.
  • Loka / Domo - Loka and Domo demonstrate production-ready agentic AI systems that orchestrate multiple models and data sources for complex tasks. Loka's Advanced Drug Discovery Assistant (ADA) integrates specialized AI models (e.g., AlphaFold, ESM) with external databases (KEGG, STRING DB) to automate pharmaceutical research workflows like protein folding and molecular docking. Domo implements agentic systems for real-time business intelligence, such as call center optimization and financial analysis, leveraging multi-source data integration and human-in-the-loop oversight.
  • London Stock Exchange Group - LSEG developed an AI-powered "Surveillance Guide" system leveraging Amazon Bedrock and Anthropic's Claude Sonnet 3.5 to automate the analysis of 250,000 RNS articles for price sensitivity, addressing the manual and resource-intensive process of correlating suspicious trading activity with news. The system employs a two-stage classification architecture with sophisticated prompt engineering for explainability and conservative decision-making. This solution achieved 100% precision in identifying non-sensitive news and 100% recall in detecting price-sensitive content on its evaluation dataset, significantly reducing analyst workload and enhancing regulatory compliance.
  • LSEG - LSEG Risk Intelligence implemented generative AI on AWS Bedrock to accelerate content curation for its WorldCheck financial crime detection platform, which processes thousands of global sources. Adopting a phased LLMOps maturity model, they progressed from prompt-only summarization and entity extraction to RAG and multi-agent orchestration with human-in-the-loop validation. This approach reduced content curation time from hours to minutes, enhancing efficiency and scalability while maintaining accuracy and regulatory compliance through human oversight.
  • Lucid Motors - Lucid Motors, an EV manufacturer, rapidly deployed agentic AI solutions across its finance organization using AWS Bedrock and PWC's Agent OS to prepare for significant growth. In 10 weeks, they developed 14 proof-of-concept use cases, integrating data from SAP, Redshift, and Salesforce to enable real-time predictive analytics for demand forecasting, investor insights, and operational efficiency. This initiative aims to transform finance into a strategic competitive advantage, providing data-driven decision-making capabilities.
  • Luna - Luna developed an AI-powered Jira analytics system using GPT-4 and Claude 3.7 to extract actionable insights from project management data, aiming to track progress and predict delays. Key technical lessons included prioritizing data quality over prompt engineering, pre-processing temporal context due to LLM "time blindness," and optimizing temperature settings (0.2-0.3) for balanced analytical judgment. Further improvements were achieved by implementing chain-of-thought reasoning, constraining output scope, explicitly prompting reasoning models, and counteracting the LLM's "yes-man" bias for critical analysis.
  • Manchester Airports Group - Manchester Airports Group (MAG) implemented an agentic AI system using Amazon Bedrock Agent Core Runtime to automate complex unplanned absence reporting and shift management for its 9,000 airport staff. This multi-agent solution, featuring text and speech interfaces, dynamically authenticates users, classifies absence types, and updates HR/rostering systems, replacing manual processes and third-party helplines. The implementation resulted in 99% consistency in absence reporting and a 90% reduction in recording time, leading to significant cost savings.
  • Manus - Manus addresses context explosion and performance degradation in long-running AI agents by implementing a comprehensive context engineering framework. This framework leverages a full virtual machine sandbox to offload tool outputs and other context to the file system, enabling file-based retrieval and a layered action space for tool management. It employs a staged reduction strategy, prioritizing reversible compaction before irreversible structured summarization, to manage token limits and optimize KV cache efficiency while maintaining agent performance.
  • Manus AI - Manus AI developed a production AI agent system that leverages context engineering on frontier models instead of fine-tuning to execute complex, multi-step tasks at scale. Their technical approach emphasizes KV-cache optimization for cost efficiency, managing tools via logit masking to preserve cache, treating the file system as an externalized persistent context, and using task recitation to maintain agent focus and prevent goal drift.
  • Matillion - Matillion developed Maya, an AI-powered agentic system built on Spring AI and multiple LLMs (e.g., Claude Sonnet 3.5) to generate DPL-based data pipelines from natural language prompts. They evolved their evaluation from informal methods to structured LLMOps, implementing simple constrained tests, LLM-as-judge with human validation, and automated testing with Langfuse for observability, which enabled confident model upgrades and addressed PII leakage in traces. This rigorous approach facilitated Maya's successful enterprise deployment and integrated MLOps practices into their traditional software engineering workflow.
  • McLeod Health / Memorial Sloan Kettering - Three major healthcare systems successfully deployed generative AI ambient clinical documentation (AI scribes) at scale, utilizing rigorous evaluation methodologies including randomized controlled trials and bias-reducing pilots to select vendors. These deployments, which necessitated deep EHR integration and continuous clinician training, yielded significant outcomes such as measurable reductions in physician burnout, daily time savings of 1.5-2 hours, improved patient satisfaction, and positive financial ROI through enhanced coding accuracy. The case study underscores the critical importance of human-in-the-loop oversight, flexible adoption strategies, and encounter-based pricing models for effective enterprise LLM deployment in high-stakes clinical environments.
  • Mercedes-Benz - Mercedes-Benz migrated its critical Global Ordering mainframe system (5M+ lines of COBOL/Java) to AWS cloud, employing a hybrid strategy that included agentic AI for code transformation. Their GenRevive tool, guided by human experts and "cookbooks," transformed 1.3 million lines of COBOL for the pricing service into maintainable Java code in months, significantly accelerating the process. This AI-powered refactoring, validated through parallel testing and subsequent manual performance tuning, resulted in zero-incident production deployment, improved performance, and reduced mainframe costs.
  • Met Office - The UK Met Office, in collaboration with AWS, automated the generation of the Shipping Forecast by fine-tuning Amazon Nova vision-language models (VLMs) and LLMs to transform complex multi-dimensional weather data into structured text. This involved processing raw gridded data into video inputs for VLMs or summarized text for LLMs, achieving 52-62% F1 accuracy with VLMs and 62% with LLMs against expert forecasts within four weeks. The solution reduced forecast generation time from hours to under 5 minutes, establishing a scalable LLMOps framework for critical, high-accuracy government services.
  • Meta - Meta developed Privacy Aware Infrastructure (PAI) to scale privacy for GenAI products like AI glasses, addressing complex data flows from sensor input to LLM inference and model training. PAI uses automated data lineage tracking across its entire technology stack to provide comprehensive observability and programmatically enforce privacy policies via APIs. This system embeds privacy controls directly into the infrastructure, ensuring compliance and accelerating GenAI product innovation across thousands of microservices.
  • Meta - Meta deployed AI-powered Video Super-Resolution (VSR) models at massive scale to enhance over a billion daily video uploads, addressing low-quality content from various sources. The solution involved a multi-platform strategy, utilizing both CPU-based (Intel RVSR SDK) and GPU-based VSR models for different use cases like ad enhancement and generative AI features (MovieGen in Restyle). Through extensive subjective human evaluation, Meta optimized VSR deployment by identifying effective proxy metrics (VMAF-UQ) and targeting only videos that meaningfully benefit, balancing quality improvements with compute cost and resource constraints.
  • Meta - Meta's Feed Deep Dive scaled an AI feature to provide context for Facebook posts, addressing challenges like LLM quality, latency, and user engagement against the main Feed. They achieved this by evolving to agentic models for dynamic context and reasoning, implementing online auto-judges for real-time quality control, and using smart caching with ML-driven user targeting. This approach improved user engagement and quality, leading to product-market fit and paving the way for monetization and advanced social AI integrations.
  • Meta - Meta rapidly scaled its Express Backbone network 10x to meet exponential AI workload growth, accelerating 2030 capacity plans to 2024-2025. This involved pre-building metro architectures, platform scaling through larger chassis, 800Gbps interfaces, and expanded backbone planes, and integrating IP with optical transport using coherent transceivers for 80-90% power and space efficiency. Additionally, they developed AI-specific infrastructure like the Prometheus project, employing direct fiber for short-range and DWDM for long-range connections between geographically distributed training clusters.
  • Meta - Meta scaled its AI network infrastructure from 24K to over 100K GPUs for LLaMA 3 and LLaMA 4 training, evolving from single-building to multi-building Clos topologies. This involved deploying deep buffer switches, optimizing communication libraries for cross-building traffic, and implementing robust monitoring. Operational challenges like network congestion, PFC issues, and firmware inconsistencies were addressed through proactive design and rapid, on-the-fly debugging and fixes to maintain training performance.
  • Meta - Meta addressed severe network bottlenecks and GPU idle time in distributed AI training, caused by massive checkpoint data growth (hundreds of GBs to tens of TBs), which led to job read latencies of 300 seconds. Their solution involved a bidirectional multi-NIC utilization strategy: ECMP-based load balancing for egress traffic using netkit/eBPF, and BGP-based virtual IP injection for ingress traffic, leveraging existing rack switches. This resulted in significant performance gains, including a 300x reduction in job read latency (to 1 second), an 8x improvement in checkpoint loading (to 100 seconds), and a 4x throughput increase by fully utilizing all network interfaces.
  • Meta - Meta developed a multi-agent LLM system to manage secure and scalable data warehouse access, addressing the complexity of cross-domain data needs for AI applications. This system features specialized data-user agents for discovery, low-risk exploration, and access negotiation, alongside data-owner agents for security operations and access configuration. It incorporates advanced capabilities like query-level access control, context-aware partial data previews, and rule-based risk management, supported by continuous evaluation and feedback loops.
  • Meta - Meta addresses critical hardware reliability challenges, especially silent data corruptions (SDCs), which cause over 66% of AI training interruptions and impact inference quality across its massive infrastructure supporting models like Llama 3. They employ sophisticated detection mechanisms (Fleetscanner, Ripple, Hardware Sentinel) and multi-layered mitigation strategies (reductive triage, hyper-checkpointing, algorithmic fault tolerance) to ensure robust operation of thousands of accelerators and maintain high reliability for large-scale AI systems.
  • Modal - Modal engineered a production-ready system for generating scannable and aesthetic QR codes by addressing the inconsistency of initial generative AI models. They developed a rigorous evaluation system, validated against human judgment, to automatically measure scan rate and aesthetic quality. This was combined with inference-time compute scaling, generating multiple QR codes in parallel and selecting the best one, ultimately achieving a 95% scan rate service-level objective within sub-20-second response times.
  • Moody's - Moody's developed AI Studio, a multi-agent AI platform, to automate complex financial workflows such as credit memo generation. This system employs specialized agents for parallel processing, integrating proprietary and third-party financial data, reducing a 40-hour manual task to 2-3 minutes. The platform is deployed commercially as a service for financial institutions and internally across 40,000 employees, demonstrating mature LLMOps for efficiency and competitive advantage.
  • Moody's Analytics - Moody's Analytics deployed a multi-agent AI system on AWS to provide high-stakes financial intelligence, evolving from basic RAG to address complex unstructured data and diverse customer needs. Their serverless architecture utilizes a custom orchestrator, specialized agents, and a multi-modal PDF processing pipeline with Bedrock Data Automation to handle documents and complex queries. This production system processes over 1 million tokens daily, delivering 60% faster insights and a 30% reduction in task completion times for critical financial decisions.
  • Moveworks - Moveworks developed "Brief Me," an agentic AI system integrated into their Copilot platform, enabling conversational interaction with uploaded enterprise documents (PDF, Word, PPT) for tasks like summarization, Q&A, and insight extraction, significantly reducing manual review time. This system employs a two-stage pipeline with sub-10s ingestion, an operation planner, hybrid search using custom-trained embeddings, and a map-reduce algorithm for long context handling, achieving high accuracy metrics like 97.24% correct actions and 89.21% groundedness with granular citations.
  • Mowie - Mowie is an AI marketing platform automating end-to-end marketing for SMBs by ingesting public and business-specific data to construct a "hierarchy of documents" (loosely structured markdown artifacts) that serve as LLM context. This architecture enables the generation of personalized content calendars and posts across channels, with customer approval via a multi-layered human-in-the-loop design and evaluation based on actual sales impact.
  • Multiplayer - Multiplayer launched a Model Context Protocol (MCP) server to provide rich engineering context from their session recording and debugging platform to AI coding agents, enabling contextually-aware assistance for bug fixing and feature development. They designed minimal, use-case-driven MCP tools that abstract complex API calls, prioritizing read-only operations for security and formatting data for LLM readability. This approach, validated through a gradual rollout, demonstrated effective integration while highlighting the need for focused tool design and security considerations in LLM agent deployments.
  • Music Smatch - LyricLens is an AI-powered platform by Music Smatch that uses Amazon Bedrock's Nova foundation models to analyze over 11 million music lyrics, extracting deep semantic meaning, themes, entities, and content moderation signals. The system processes billions of tokens, enabling real-time semantic search, trend analysis, and granular content filtering through a knowledge graph. Employing sophisticated LLMOps practices like LLM-as-judge evaluation and model distillation, the platform achieved over 30% cost savings while maintaining accuracy.
  • Myriad Genetics - Myriad Genetics addressed high costs and manual effort in healthcare document processing, which previously used Amazon Textract and Comprehend, by deploying AWS's open-source GenAI IDP Accelerator with Amazon Bedrock (Amazon Nova Pro for classification, Nova Premier for extraction). Leveraging advanced prompt engineering, multimodal processing, and few-shot learning, the solution increased classification accuracy from 94% to 98%, reduced classification costs by 77%, and decreased processing time by 80% (from 8.5 to 1.5 minutes per document). This automation achieved 90% extraction accuracy, matching human performance, and is projected to save $132K annually while reducing prior authorization processing time by two minutes per submission.
  • Navismart AI - Navismart AI developed a multi-agent AI system to automate complex, high-stakes immigration processes, starting with the stringent US regulatory framework. The solution uses a modular microservices architecture with specialized agents communicating via REST APIs, deployed on Kubernetes for scalable orchestration and fault isolation. It integrates Google OCR, stateful session management, end-to-end encryption, and human-in-the-loop capabilities, leveraging a custom orchestration framework with LangGraph for monitoring.
  • Needl.ai - Needl.ai's AskNeedl, a RAG-based enterprise knowledge system, faced user trust issues due to "hallucination-adjacent failures" like missing citations and vague answers, crucial for compliance and financial services users requiring auditability. To address this, they implemented a lightweight, high-signal manual feedback loop, categorizing failures, creating themed QA sets, and collaborating with users to refine quality. This enabled targeted optimizations to retrieval, prompting, and citation formatting without LLM retraining, prioritizing fixes for high-stakes queries to enhance output reliability and user confidence.
  • Neople - Neople developed AI-powered "digital co-workers" that automate customer support and business processes, evolving from providing AI suggestions to human agents to fully automating responses and executing multi-step actions across external systems. Its technical architecture features an agentic RAG system with multi-strategy search and specialized embeddings, robust external system integrations including browser automation, and an extensive LLM-based evaluation pipeline to ensure response quality and prevent hallucinations. This approach has significantly improved first response rates and resolution times for 200+ customers, enabling non-technical users to configure complex workflows and expanding AI use beyond customer service into finance and operations.
  • Netflix - Netflix addressed its fragmented recommendation system, comprising dozens of specialized models, by developing a unified autoregressive transformer-based foundation model for personalization. This model adapted LLM principles like multi-task learning and long context windows to learn user representations from interaction sequences, scaling to billions of parameters and demonstrating that scaling laws apply to recommendation systems. The approach yielded significant performance improvements and operational efficiencies, providing high leverage across downstream applications through centralized learning and streamlined fine-tuning.
  • Netguru - Netguru developed Omega, a multi-agent AI sales assistant embedded in Slack, to address inefficiencies from scattered information and repetitive tasks in their growing sales team. It uses AutoGen for specialized agent orchestration, runs on serverless AWS infrastructure, and integrates with tools like Google Drive and BlueDot to provide context-aware assistance for tasks such as preparing call agendas, summarizing conversations, and generating proposals. This system automates routine work and reinforces sales processes, improving efficiency and consistency directly within the team's existing workflow.
  • NewDay - NewDay, a financial services company, implemented NewAssist, a generative AI agent assist chatbot, to help customer service agents quickly find answers from 200 knowledge articles, reducing answer retrieval time from 90 to 4 seconds. The solution utilizes a RAG architecture on AWS serverless infrastructure with Anthropic's Claude 3 Haiku via Amazon Bedrock, achieving over 90% accuracy through custom data processing, iterative experimentation, and user feedback. This approach enabled cost-effective scaling, with running costs under $400 per month.
  • NFL / AWS - NFL NextGen Stats and AWS Generative AI Innovation Center developed a production fantasy football AI assistant in 8 weeks for NFL Plus subscribers, utilizing an agentic system on Amazon EKS with Amazon Bedrock and the Strands Agent framework. This system provides analyst-approved advice by accessing NFL NextGen Stats via Model Context Protocol, employing a semantic data dictionary and consolidated tools for efficient data retrieval and reasoning. It achieves sub-5-second initial responses and 90% analyst approval, handling peak Sunday traffic with zero incidents through deep observability, fallback models, and a sliding window caching strategy.
  • nib - nib, an Australian health insurance provider, enhanced contact center efficiency and agent experience by deploying AWS generative AI solutions. Key implementations include a conversational AI assistant "Nibby" built on Amazon Lex, LLM-powered call summarization reducing after-call work by 50%, and a RAG-based internal knowledge GPT for agents. These initiatives resulted in 60% chat deflection and $22 million in savings from Nibby, alongside improved agent productivity and faster information retrieval.
  • Nippon India Mutual Fund - Nippon India Mutual Fund enhanced their AI assistant's accuracy and reduced hallucinations by transitioning from naive RAG to an advanced implementation for processing complex financial documents. Their solution leveraged Amazon Bedrock Knowledge Bases, incorporating multi-pronged parsing, semantic chunking, LLM-driven query reformulation for multi-query RAG, and results reranking. This resulted in over 95% accuracy improvement, 90-95% hallucination reduction, and decreased report generation time from two days to ten minutes.
  • None - A panel of AI leaders discussed the significant gap between rapid agentic AI prototyping and production deployment in enterprises. Key technical bottlenecks include pervasive data quality and governance issues, limitations in information retrieval and function calling, and challenges in ensuring security, privacy, and verifiability of agent actions at scale. Widespread enterprise adoption of autonomous agents is years away, largely due to these operational and organizational hurdles rather than just AI model capabilities.
  • Notion - Notion AI scales its product development for over 100 million users by dedicating 90% of its AI development time to rigorous evaluation and observability. They utilize Brain Trust as their primary platform for rapid model switching, modular prompt management, and custom LLM-as-a-judge systems, supported by specialized data specialists who create targeted evaluation datasets. This infrastructure ensures consistent quality, cost optimization, and reliability across diverse features, including multilingual support and complex agentic workflows.
  • Novartis - Novartis partnered with AWS and Accenture to build a GXP-compliant, next-generation data platform, integrating diverse data sources via a modular data mesh architecture to accelerate drug development. This platform enables LLMOps for generative AI applications like clinical protocol drafting, achieving 83-87% acceleration in generating regulatory-acceptable protocols. Initial deployment in the patient safety domain demonstrated significant efficiency gains, including a 72% reduction in query execution time and 60% lower storage costs.
  • Nvidia - A panel of experts discussed deploying agentic AI, advocating for prototyping with closed-source models for speed before transitioning to open-source for compliance and cost, while favoring low-abstraction frameworks for debuggability. They emphasized standardizing on OpenTelemetry for observability across heterogeneous frameworks and allowing framework experimentation in sandboxes, with strict gates for production deployment. The discussion also noted that advanced reasoning models simplify agent architecture and expressed skepticism about low-code tools for high-precision, customer-facing applications.
  • NVIDIA / Lepton AI - Yangqing Jia's lecture outlines the evolution of AI systems and LLMOps, emphasizing the shift to "AI-native" infrastructure optimized for massive computation and tight coupling, exemplified by rack-scale GPU systems like NVIDIA NVL72. It details model advancements (MoE, test-time scaling, RL), application patterns (RAG, agentic AI for consumer/enterprise), and operational challenges including GPU supply chain management, utilization, and the need for AI-specific platforms like Ray to abstract Kubernetes complexity. The talk stresses that successful production LLMOps integrates novel AI techniques with established software engineering and operational best practices.
  • Octus - Octus migrated its Credit AI RAG application from a multi-cloud architecture (OpenAI on Azure, data on AWS) to a unified AWS architecture centered on Amazon Bedrock to address scalability, cost, and operational complexity. This involved leveraging Bedrock's managed services for embeddings, knowledge bases, and LLM inference (e.g., Cohere, Claude), alongside AWS services like S3, OpenSearch, and Textract for data ingestion and retrieval. The migration resulted in a 78% reduction in infrastructure costs, an 87% decrease in cost per question, and improved document sync times from hours to minutes, while enabling multi-tenancy and SOC2 compliance.
  • ONA - ONA developed an AI coding agent platform for highly regulated environments, addressing data security and compliance barriers that often lead to AI bans and shadow IT. The platform runs entirely within the customer's Virtual Private Cloud (VPC), using isolated, disposable virtual machines for agents and a pull-based orchestration model. This architecture ensures all code and sensitive data remain within the customer's network, leveraging client-side encryption and direct LLM provider connections without ONA accessing customer data.
  • OpenAI - OpenAI deployed an autonomous code review system, built on GPT-5-Codex models, to verify high volumes of AI-generated code exceeding human review capacity. This agentic system provides repository-wide context and code execution, is specifically trained for precision over recall to maintain developer trust, and processes over 100,000 external PRs daily. It demonstrates significant impact, with internal comments leading to code changes in over half of cases, integrating LLMs as core infrastructure for software quality.
  • OpenAI - OpenAI's Forward Deployed Engineering (FDE) team embeds with enterprise customers to build and deploy production-grade LLM applications, addressing complex, high-value problems. They employ a methodology of deep domain understanding, evaluation-driven development, hybrid deterministic-probabilistic architectures, and extended trust-building phases to achieve significant efficiency gains and high user adoption. This approach allows OpenAI to extract generalizable product insights and frameworks, like Agent Kit, from specific customer challenges, ensuring successful LLM integration at scale.
  • OpenAI / Manus - A Carnegie Mellon study critically examined commercial AI agents, categorizing them into orchestration, creation, and insight types based on market analysis. Empirical user testing with 31 participants revealed five critical usability barriers, including misalignment with user mental models, premature trust assumptions, and overwhelming communication overhead, highlighting a significant gap between marketed capabilities and practical user experience. This underscores the need for LLMOps to prioritize human-centered design and robust UX testing alongside technical performance metrics for successful agent deployment.
  • Oso - Oso's platform tackles production-ready AI agent challenges by implementing deterministic, code-based guardrails through LangChain 1.0 middleware, establishing a three-component identity model (user, agent, session) for granular authorization. This system proactively filters tools and reactively blocks actions based on session context to prevent prompt injection and data exfiltration, enabling real-time governance with monitoring, dynamic tool management, and quarantine capabilities for robust production control.
  • Otomoto - Prosus developed an AI agent for Otomoto car dealers to optimize listings from complex platform data, but an initial chat-based agent suffered from low engagement due to "chat fatigue." To address this, they implemented a dynamic UI with context-aware action buttons, interactive responses, and purpose-built data aggregation tools using CSV for token efficiency, alongside streaming for perceived latency reduction. This iterative approach significantly boosted user engagement, demonstrating that guided interactions via buttons led to more follow-up questions than open-ended chat.
  • Otto - Otto integrates autonomous AI agents directly into spreadsheet cells, enabling non-technical users like finance teams to automate complex workflows by treating each cell as an independent agent capable of executing tasks and processing unstructured data into structured outputs. Their LLMOps strategy includes a three-tiered model selection framework to optimize cost and performance for different task complexities, alongside pragmatic evaluation methods like internal task-based testing and canary deployments with customers to validate real-world utility.
  • Outropy - Outropy, a failed AI startup, provided key architectural lessons for production GenAI systems by treating workflows as multi-step data pipelines and agents as coarse-grained distributed objects. Their sophisticated LLMOps included an event-sourced memory architecture using semantic events and a probabilistic graph database, orchestrated by durable workflows for resilience against unreliable LLM APIs. This approach highlighted the need for rigorous software engineering principles adapted to AI's unique stateful and non-deterministic characteristics, challenging traditional microservices and Twelve-Factor App paradigms.
  • Owkin - Owkin developed a healthcare copilot over four months to assist biology and life science researchers in navigating complex data, addressing challenges like strict regulations and semantic ambiguity. This system features a text-to-SQL tool for structured biological databases, leveraging Polars with enhanced metadata and Pydantic for robust query generation, and a RAG-based literature search tool for PubMed, employing hybrid search, metadata filtering, and reranking. Deployed with LangFuse/OpenTelemetry monitoring and custom expert-validated benchmarks, the copilot aims to augment research capabilities while managing costs and ensuring data compliance.
  • Patho AI - Patho AI developed a Knowledge Augmented Generation (KAG) system, extending traditional RAG by integrating structured knowledge graphs to perform complex reasoning for competitive intelligence and strategic advisory. This system addresses limitations of vector-based RAG in numerical reasoning and multi-hop queries by using a "wisdom graph" architecture, orchestrated via Node-RED with Neo4j as the graph database, achieving 91% accuracy in internal benchmarks for structured data extraction. The KAG system employs a hybrid approach for knowledge graph population, combining automated LLM extraction with human expert validation to ensure accuracy and maintain domain expertise.
  • Payfit / Alan - Payfit and Alan deployed Dust.tt's enterprise AI platform to enhance productivity across various departments using LLM-powered assistants. The implementation leveraged retrieval-augmented generation (RAG), multi-agent workflows, and an abstraction layer for seamless LLM switching (e.g., Claude 3.5 Sonnet) to integrate with existing tech stacks. This strategy resulted in approximately 20% productivity gains in areas like sales, customer support, and HR, supported by a structured adoption framework and continuous iteration.
  • PayPay - PayPay developed GBB RiskBot, an automated RAG-enhanced code review bot, to address inconsistent code quality and knowledge silos in its rapidly growing codebase. The system ingests historical incident data, embeds it using OpenAI's text-embedding-ada-002, and stores it in ChromaDB. When a pull request is opened, the bot retrieves relevant incidents via semantic search and uses GPT-4o-mini to generate contextual risk comments, operating at a remarkably low cost of approximately $0.59 per month for hundreds of analyses.
  • PayU - PayU, a regulated financial services provider, implemented a secure enterprise AI assistant to mitigate risks from employees using public AI tools, which violated data residency and security requirements. Their solution leverages Amazon Bedrock for foundation models, Open WebUI for the frontend, and AWS PrivateLink for private VPC connectivity, ensuring data remains within their regulated environment. This multi-agent system, featuring RAG and text-to-SQL capabilities with granular access control and Bedrock Guardrails, reportedly improved business analyst productivity by 30% while maintaining strict compliance.
  • PerformLine - PerformLine developed an AI-powered marketing compliance system to efficiently monitor complex product pages with overlapping content. Leveraging a serverless AWS architecture with Amazon Bedrock (Nova Pro, Claude Haiku), they implemented a multi-pass inference strategy and prompt management to extract contextual information from millions of pages daily. This resulted in a 15% reduction in human evaluation workload and over 50% reduction in analyst workload through intelligent content deduplication and processing of 1.5-2 million pages daily.
  • Perk - Perk, a business travel platform, automated proactive hotel payment verification calls using an AI voice agent system to prevent virtual credit card (VCC) failures that caused poor customer experiences. This system, built with OpenAI LLMs, text-to-speech, and Twilio, replaced 10,000 weekly manual calls by autonomously contacting hotels to confirm VCC receipt and request payment processing. Through iterative prompt engineering and a multi-stage conversational design, it now handles tens of thousands of calls weekly across multiple languages, achieving human-level performance and providing valuable operational insights.
  • Personize.ai - Personize.ai developed a multi-agent personalization engine for batch processing that addresses challenges of consistency, cost, and deep customer understanding inherent in traditional RAG/function calling for large customer databases. Their solution, "Cortex," employs a proactive memory system to infer and synthesize customer insights into standardized, shared attributes, enabling centralized recall and compressed context for all agents. This approach facilitates rapid, deep customer understanding and generates high-quality, domain-specific personalized content at scale, significantly reducing deployment time and operational costs for autonomous personalization.
  • PetCo - PetCo transformed its contact center, handling over 10,000 daily interactions, by consolidating on Amazon Connect and deploying AI/LLM capabilities to balance cost efficiency with customer satisfaction. They implemented call summarization, automated QA, AI-supported agent assistance, and a generative AI chatbot (Amazon Q in Connect) for high-frequency use cases like order status and grooming call routing. This resulted in reduced agent handle times, improved routing efficiency, and enhanced self-service, with plans for further expansion into conversational IVR and direct appointment booking.
  • PGA Tour - PGA Tour addressed the challenge of generating timely, engaging, and accurate golf content at scale from vast data by implementing two AWS Bedrock-based AI systems. These systems include a multi-agent platform producing up to 800 articles weekly across eight content types and a real-time shot commentary system, both leveraging sophisticated validation and queue-based architectures for accuracy and efficiency. This solution achieved a 95% cost reduction, enabled content publication within 5-10 minutes of events, and generated billions of annual page views, significantly boosting fan engagement and SEO.
  • Philips - Philips partnered with AWS to develop a cloud-native integrated diagnostics platform, leveraging AWS Health Imaging to manage 134 petabytes of multi-modal medical imaging data, including gigapixel pathology slides. This platform integrates AI-assisted algorithms for tasks like quality control and pre-diagnosis, and supports multi-modal model training on consolidated datasets, dramatically reducing pathology report times from over 11 hours to 36 minutes. The architecture also incorporates generative AI via AWS HealthScribe for clinical note generation, enabling real-time collaboration and a unified patient view across specialties.
  • Picnic - Picnic automated grocery inventory counting using a multimodal LLM-based computer vision system in their fulfillment center, replacing inefficient manual checks. They deployed camera setups to capture tote images, leveraging Google Gemini models enhanced by domain-specific supply chain reference images for accurate counting. Cost-effectively, they fine-tuned mid-tier models to achieve high performance, deploying via a Fast API service with LiteLLM for model interchangeability and continuous validation through selective manual checks.
  • Pictet Asset Management - Pictet Asset Management initially implemented a centralized AWS Bedrock architecture with a custom "Gov API" to govern diverse generative AI use cases and ensure compliance in a regulated financial environment. This centralized model, however, encountered significant scalability issues, resource contention, and complex cost allocation problems as the number of projects grew. Consequently, they pivoted to a federated architecture where individual teams manage their own Bedrock services, while a central team maintains oversight through cross-account monitoring and standardized guardrails, enabling better scalability, cost ownership, and faster iteration within compliance boundaries.
  • Pinterest - Pinterest developed a hybrid LLM-powered system to identify user journeys (long-term goals) from user activity, aiming to transform into an "inspiration-to-realization" platform despite limited training data. The system uses streaming inference to extract and embed keywords from user data, clusters them into journey candidates, and leverages LLMs for tasks like journey naming, expansion (using GPT for data generation and fine-tuning Qwen for inference), and relevance evaluation, complemented by traditional ML for ranking. This approach led to significant improvements, including an 88% higher email click rate and 32% higher push open rate for journey-aware notifications, demonstrating the value of understanding user intent.
  • Pinterest - Pinterest democratized GenAI by implementing a multi-layered LLMOps platform with a multi-vendor model strategy, a centralized proxy for operational control, and a "Prompt Hub" for streamlined prompt development, evaluation, and one-button production deployment. This platform includes an "AutoPrompter" system that uses LLM agents for automated prompt optimization through iterative critique, enabling rapid iteration and empowering non-technical employees to achieve significant performance improvements and cost reductions.
  • Pinterest - Pinterest integrated LLMs into their search relevance pipeline using a two-tier architecture: a high-performing cross-encoder teacher model (e.g., Llama 8B, achieving 12-20% improvement) and an efficient bi-encoder student model optimized for production via knowledge distillation. This system processes billions of monthly searches across many languages, incorporating visual captions and user engagement signals to enhance content understanding. The deployment improved search relevance metrics globally and generated valuable reusable semantic embeddings for other platform features.
  • Portia AI / Rift.ai - Portia AI and Rift.ai address production challenges for AI agents by implementing robust guardrails, such as structured task-based workflows and RAG-based conditional example prompting. Both companies prioritize context engineering, finding that smaller, precisely managed context windows with dynamic tool-based retrieval outperform large context windows for precision. Security is managed through human-in-the-loop approvals, explicit access controls, and a recognized need for agent-specific identity and authorization standards.
  • Portola - Portola developed Tolan, an AI companion app, facing the challenge of ensuring subjective conversation quality and emotional authenticity, which traditional automated LLM evaluations could not adequately measure due to the system's complex, multimodal architecture. Their LLMOps workflow empowered non-technical domain experts to identify issues from production logs, curate problem-specific datasets, manually iterate on prompts in a playground, and deploy changes directly via a prompts-as-code infrastructure. This approach significantly accelerated prompt iteration velocity and led to systematic qualitative improvements in conversation quality, memory, and brand voice by centering human judgment.
  • Predibase / Rubrik - Predibase, an LLMOps platform specializing in model fine-tuning and efficient inference via frameworks like Lorax for multi-LoRA serving, was acquired by Rubrik, a data security and governance company. This merger addresses the critical challenge of over 50% of GenAI pilots failing to reach production by integrating Predibase's advanced post-training and serving capabilities with Rubrik's secure data infrastructure. The combined platform aims to provide an end-to-end solution for deploying enterprise generative AI applications securely, with high model quality, low latency, and optimized cost.
  • private university - A private university implemented a privacy-preserving chatbot using LiteLLM's proxy server to manage multi-model access, cost control, and governance for students and employees. This OpenAI-compatible gateway enabled unified API access, automatic cost tracking, budgeting, and load balancing across various LLM providers. However, the implementation faced limitations with complex budgeting requirements, delayed support for new provider features, and stability issues with LiteLLM's rapidly evolving features, necessitating custom workarounds.
  • Product Talk - Teresa Torres developed an AI interview coach to provide automated, multi-dimensional feedback on student interview transcripts for product discovery training, evolving from simple prototypes to a production system using Anthropic API, Zapier, and AWS Lambda. A core technical achievement was implementing a rigorous evaluation methodology, including systematic error analysis, code-based evals, and LLM-as-judge evals, to continuously improve feedback quality and ensure alignment with course teachings. This system, integrated with an LMS and leveraging Python/Jupyter for evals, now processes interviews with ongoing monitoring and is scaling to handle real customer data through a partnership for SOC 2 compliance.
  • Product Talk - Teresa Torres developed an AI interview coach using a serverless AWS architecture with multiple LLM calls (Anthropic, OpenAI) to analyze student interview transcripts and provide detailed feedback on specific interviewing techniques. A key aspect was her comprehensive evaluation framework, which combined code-based assertions, LLM-as-judge evals, and human annotation, to iteratively refine prompt engineering and ensure the coach's feedback aligned with her pedagogical standards. This iterative evaluation loop, including prompt caching for cost efficiency, enabled her to deploy a reliable educational tool despite starting with limited AI engineering experience.
  • Proess - Proess developed Toqan, an internal AI productivity platform that evolved from a Slack bot to a comprehensive system serving over 30,000 employees across 100+ portfolio companies. Its architecture supports agent-based systems with tool calling for flexible capability expansion, integrates multiple LLMs via various interfaces (web, API), and focuses on enterprise system integrations to automate complex, multi-step workflows. The platform employs sophisticated LLMOps for continuous model evaluation, versioning, and a memory system for operational efficiency in task execution.
  • PromptLayer - PromptLayer developed a multi-agent AI system for hyper-personalized email campaign automation to overcome low engagement in outbound marketing. This system employs three specialized agents: one for lead research and scoring using GPT-4o-mini, another for subject line generation with tiered model usage, and a third for crafting tailored multi-touch email sequences, all managed via PromptLayer's LLMOps platform enabling non-technical prompt iteration. Integrating with sales tools like Apollo and HubSpot, it achieves 50-60% open rates and 7-10% positive reply rates, generating 4-5 qualified demos daily.
  • Propel - Propel implemented two AI-powered systems to help 200,000 monthly SNAP users resolve benefit interruptions: an AI-generated code system for structured triage flows in California and a nationwide conversational AI assistant using Decagon. These systems, deployed with robust LLMOps practices including careful monitoring and human escalation, achieved a 53% user uptake and significantly faster benefit restoration. The initiative successfully reduced program churn and administrative burden by leveraging LLMs for both development acceleration and direct user interaction, including dynamic multilingual support.
  • Propel Holdings / Xanterra Travel Collection - Propel Holdings (fintech) and Xanterra Travel Collection (travel) deployed Cresta's LLM-powered AI agents to scale contact center operations and manage high volumes of routine inquiries. They implemented a phased approach, starting with FAQ-based autonomous agents and agent assist, then integrating APIs for transactional capabilities, achieving chat containment rates up to 90% and voice containment up to 30%. This enabled 24/7 coverage, rapid deployment of multiple agents, and redeployment of human agents to complex tasks, demonstrating mature LLMOps practices like data-driven tuning and continuous monitoring.
  • PropHero - PropHero developed a multi-agent conversational AI system on Amazon Bedrock to provide scalable, multilingual property investment advice, addressing complex multi-turn conversations in Spanish. This system employs strategic model selection across specialized agents (e.g., Claude 3.5 Haiku, Amazon Nova Pro/Lite) and an integrated RAG with Bedrock Knowledge Bases and Cohere Rerank for optimal cost-performance and accuracy. It features a continuous evaluation system using an LLM-as-a-judge pattern, achieving 90% goal accuracy and a 60% reduction in AI costs while reducing customer service workload by 30%.
  • Prosus - Prosus, a global e-commerce company, is deploying 30,000 internal AI agents by March 2025 using its proprietary Toqan platform, enabling non-technical employees to build agents with varying levels of system access and tool integration. These agents automate tasks from natural language data querying (integrating with Databricks/Tableau) to orchestrating complex workflows (e.g., a Restaurant Account Executive agent performing the work of 30 FTEs), aiming to enhance employee productivity, quality, and independence across the organization.
  • Providence Health System - Providence Health System automated the processing of 40 million annual healthcare referral faxes, which previously caused multi-month backlogs and delayed patient care due to manual transcription into Epic EHR. Their LLMOps solution, built on Databricks with MLflow, uses Azure AI Document Intelligence for OCR and OpenAI's GPT-4.0 for information extraction, systematically experimenting with models and prompts to handle diverse document types. This system now provides real-time processing, eliminating backlogs and freeing clinical staff for direct patient care across their extensive network.
  • Prudential Financial - Prudential Financial, in partnership with AWS, developed a microservices-based multi-agent platform to streamline workflows for 100,000+ financial advisors, replacing interactions with dozens of disparate IT systems. This platform features an orchestration agent that dynamically routes natural language queries to specialized sub-agents (e.g., Quick Quote, Forms, Product) while maintaining context and enforcing governance. The architecture emphasizes modularity, A2A/MCP protocols, and a centralized LLM gateway, reducing time-to-value for new AI solutions from 6-8 weeks to 3-4 weeks while addressing complex context management and performance challenges at enterprise scale.
  • PwC / AWS - PwC and AWS integrated Automated Reasoning checks into Amazon Bedrock Guardrails to enable mathematically verifiable LLM outputs for responsible AI deployment in regulated industries. This system uses formal mathematical verification, encoding domain knowledge into logic rules, as a secondary validation layer to ensure compliance and auditability, moving beyond traditional probabilistic methods. It has been applied to use cases like EU AI Act compliance in financial services, pharmaceutical content review, and utility outage management, demonstrating enhanced accuracy and traceable reasoning paths for critical AI decisions.
  • PyConDE / PyData - A volunteer team for PyConDE/PyData conferences used AI coding agents like Claude and Gemini over three months to automate operational tasks such as ticketing, marketing, and video production for up to 1,500 attendees. Agents successfully handled well-documented API integrations, LinkedIn scraping, and automated video cutting using computer vision for rapid content turnaround. However, they struggled with multi-step workflows, data normalization, and maintaining code quality, requiring tight human oversight and frequent commits, ultimately providing value but not 10x productivity gains.
  • Quotient AI - Quotient AI automates the improvement of production AI agents by transforming real-world telemetry (agent traces) into reinforcement learning signals. Their platform ingests these traces, uses specialized models to analyze trajectory quality, and then trains open-source models, providing an OpenAI-compatible API endpoint for the improved, specialized agent. This process dramatically reduces the agent improvement cycle from weeks or months to approximately one hour, enabling continuous, automated optimization without manual overhead.
  • Radian Group - Radian Group deployed the Radian Virtual Assistant (RVA), an enterprise GenAI solution, to resolve inefficient knowledge access for operations and underwriting teams, who previously spent excessive time manually searching extensive documentation. Leveraging AWS Bedrock Knowledge Base for retrieval-augmented generation (RAG), the RVA provides natural language querying across multiple enterprise data sources like SharePoint and Confluence, while ensuring robust security, compliance, and traceability through features like role-based access control and citation tracking. This implementation achieved a 70% reduction in guideline triage time, a 30% faster training ramp-up for new employees, and a 96% positive user feedback rate, demonstrating significant operational efficiency and user satisfaction.
  • Ragas - This case study details Ragas' methodology for systematically improving LLM applications, addressing the common challenge where AI engineers lack objective evaluation frameworks and rely on slow, subjective human review. Their solution involves an evaluation-driven development approach encompassing dataset curation, human annotation, scaling with LLM-as-judge systems, error analysis, and structured experimentation. This enables teams to move from subjective "vibe checks" to data-driven improvements, enhancing AI application performance and user satisfaction through continuous feedback loops.
  • Railway - This case study demonstrates an AI-powered autonomous infrastructure monitoring and self-healing system that detects production issues like memory leaks or slow queries. It leverages durable workflows to gather comprehensive observability data (metrics, logs), which LLMs then analyze to generate diagnostic plans. An OpenCode agent subsequently uses these plans to automatically create pull requests with proposed code fixes, aiming to automate incident remediation rather than just alerting.
  • Ramp - Ramp developed an MCP server to enable natural language querying of business spend data, initially exposing their API directly to Claude. Facing scaling issues with large datasets due to context window limits and high token usage, they pivoted to a SQL-based architecture. This involved a local in-memory SQLite database and an ETL pipeline to transform API data, allowing Claude to efficiently query tens of thousands of transactions by generating SQL queries, significantly reducing token usage and improving performance.
  • Ramp - Ramp addressed a data bottleneck, where analyst-mediated data questions caused significant delays, by deploying an agentic AI system called Ramp Research. This system, integrated into Slack, leverages programmatic tools to explore data across dbt, Looker, and Snowflake, combining structured metadata with domain documentation to provide self-service analytics. It processed over 1,800 questions from 300 users in six weeks, increasing data question volume by 10-20x and enabling faster, democratized data access.
  • Ramp - Ramp, a financial services company, replaced its fragmented, inconsistent industry classification system with an in-house Retrieval-Augmented Generation (RAG) model to standardize business categorization using NAICS codes. This RAG system leverages embeddings for initial retrieval and an LLM for final prediction, constrained to valid NAICS outputs, and employs a sophisticated two-prompt approach to manage context and improve accuracy. The solution achieved significant improvements in classification accuracy (up to 60% in retrieval, 5-15% in fuzzy accuracy), enhanced auditability, and provided consistent, high-quality industry data crucial for compliance, risk assessment, and sales targeting.
  • Ramp - Ramp developed an LLM-based agent to automate correction of misclassified credit card transactions, a process previously requiring hours of manual intervention due to ambiguous payment processor data. This agent leverages multimodal RAG, combining transaction details, receipt images, and user input with dual-strategy embedding retrieval, to provide rich context for the LLM to create, update, or reassign merchant classifications under strict guardrails. The system now handles nearly 100% of requests in under 10 seconds, achieving a 99% classification improvement rate and drastically reducing operational costs.
  • Ramp - Ramp developed LLM agents for automated expense management, particularly for expense approvals, prioritizing user trust in a high-stakes financial domain. Their technical approach emphasizes explainable reasoning with verifiable citations, categorical uncertainty handling via a "Needs Review" state, and user-configurable autonomy levels through a workflow builder. This system, supported by collaborative context management and robust evaluation, now autonomously handles over 65% of expense approvals.
  • Ramp - Ramp developed an MCP server using FastMCP to enable natural language querying of business financial data, initially exposing RESTful API endpoints as tools. To overcome scalability and token usage issues, they evolved the system to an in-memory SQLite database, transforming API data into SQL rows and having the LLM generate SQL queries for efficient data analysis. This architecture significantly improved performance, allowing analysis of tens of thousands of spend events, though the case study highlights ongoing challenges with latency and reliability in production LLM systems.
  • Ramp - Ramp deployed LLM-powered agents for automated expense management, achieving over 65% automated approval rates by focusing on building user trust in a high-stakes financial environment. This was accomplished through transparent decision explanations linked to policy documents, a three-tier uncertainty handling framework (Approve/Reject/Needs Review) instead of confidence scores, and user-controlled autonomy levels. The system also incorporates collaborative context management with user-editable policies and a progressive trust-building deployment strategy.
  • Ramp - Ramp implemented an LLM-powered AI agent to automate merchant classification and transaction matching, addressing issues caused by cryptic payment processor data that previously required hours of manual intervention. This agent leverages multimodal RAG, combining transaction details, receipt image data via computer vision, user memos, and vector embeddings for similar merchants, enabling sub-10-second processing with 99% classification improvement. Robust guardrails and an "LLM as judge" evaluation system ensure accuracy and reliability, reducing operational costs from hundreds of dollars to cents per request.
  • Ramp - Ramp resolved inconsistent and inaccurate customer industry classifications by developing an in-house RAG system to standardize classifications using NAICS codes. This system leverages embedding-based retrieval to generate relevant NAICS recommendations, followed by a two-stage LLM prompting process to select the most accurate classification. The implementation significantly improved data quality, consistency, and auditability, providing granular and flexible industry insights crucial for various business functions.
  • Ref - Ref is a commercial Model Context Protocol (MCP) server providing precise documentation search for AI coding agents, utilizing a RAG pipeline with Turbopuffer for indexing and an expensive web crawling layer for continuous updates. Its credit-based pricing ($0.009/search, $9/month minimum for 1,000 credits) covers both variable search costs and substantial fixed indexing expenses, catering to diverse user volumes from individual developers to high-volume agents. This model has successfully scaled to thousands of weekly users and hundreds of paying subscribers, validating its production and economic viability in a new market.
  • Relevance AI - Relevance AI implemented DSPy-powered self-improving agents for automated outbound sales email generation, integrating a human-in-the-loop feedback mechanism where human-approved outputs continuously update DSPy's training data. This system leverages DSPy optimizers to dynamically refine prompts, achieving 80% human-quality emails and a 50% reduction in agent development time. The architecture employs caching and parallel processing for 1-2 second response times, with performance evaluated using semanticF1 scores.
  • Rest - Rest developed an AI-powered sleep coach to deliver the Cognitive Behavioral Therapy for Insomnia (CBTI) protocol, addressing the inaccessibility of traditional CBTI due to high costs and long waitlists. The solution leverages OpenAI's GPT-4, primarily through a voice-first interface via Vapi, incorporating Retrieval-Augmented Generation (RAG), a multi-layered memory system for personalization, and dynamic agenda generation to guide users through an 8-week program. This iterative approach, driven by rigorous error analysis and domain expert feedback, has led to significant user adoption, with voice interactions becoming the preferred modality for intimate and effective coaching.
  • Rio Tinto - Rio Tinto developed a GenAI knowledge assistant for technical training in mining operations to provide quick, accurate access to specialized institutional knowledge. This assistant uses a hybrid RAG architecture on Amazon Bedrock, integrating vector search with knowledge graphs to overcome limitations of traditional vector-only RAG. The hybrid system demonstrated superior performance, retrieving significantly more relevant documents and achieving higher context quality and entity recall with greater consistency compared to vector-only approaches.
  • Ripple - Ripple developed an AI-powered multi-agent platform on AWS to automate monitoring and troubleshooting of its decentralized XRP Ledger, addressing the challenge of C++ experts manually analyzing petabytes of logs over 2-3 days per incident. This solution leverages Amazon Bedrock, Neptune Analytics for graph-based RAG on C++ code, and CloudWatch, orchestrating specialized agents to correlate code and logs. The platform transforms multi-day manual investigations into conversational insights delivered in minutes, significantly improving operational efficiency and removing critical expert dependencies.
  • Rippling - Rippling is deploying production AI agents across its enterprise HR, IT, and finance platform to assist administrators and employees with complex workflows like payroll troubleshooting and sales briefing. The company evolved its AI strategy from simple summarization to more flexible deep agent architectures, leveraging LangChain and LangSmith for development, orchestration, and tracing. This approach enables agents to handle nuanced, context-dependent queries and enhance productivity across their integrated platform while maintaining enterprise-grade reliability.
  • Riskspan - Riskspan developed a GenAI solution on AWS, leveraging Claude LLM and RAG, to automate the analysis of complex private credit deals from diverse unstructured documents. This system dynamically generates executable code for investment waterfall modeling, replacing a 3-4 week manual process. The implementation reduced deal processing time to 3-5 days, cut per-deal costs by 90x to under $50, and achieved 10x scalability, unlocking a $9 trillion market opportunity.
  • Robinhood Markets - Robinhood developed an LLMOps platform for financial AI agents, employing a hierarchical tuning strategy (prompt, trajectory, and LoRA fine-tuning) to balance cost, quality, and latency for use cases like customer support and content generation. This approach, supported by rigorous stratified data creation and a multi-layer evaluation system, enabled their CX AI agent to achieve over 50% latency reduction (from 3-6s to under 1s) while maintaining quality parity with larger frontier models.
  • Rocket Companies - Rocket Companies built a unified data foundation on AWS, consolidating 10+ petabytes from 12+ OLTP systems into an S3-based data lake with Apache Iceberg to overcome data fragmentation and enable its AI strategy. This foundation supports 210+ production ML models and powers agentic AI applications, allowing executives to query business intelligence data via natural language, converting queries to SQL against governed data products. This transformation significantly reduced mortgage approval times and delivered measurable business impacts, including a 20% increase in refinance pipeline and a 3x recapture rate.
  • Rocket Companies - Rocket Companies developed Rocket AI Agent, a conversational assistant built on Amazon Bedrock Agents, to streamline the complex home buying process by providing 24/7 personalized guidance and actionable self-service. This system leverages a modular architecture with domain-specific agents, Bedrock Knowledge Bases for proprietary data, and Guardrails for responsible AI, achieving a threefold increase in loan conversion rates and an 85% reduction in customer care transfers.
  • Roots - Roots, an insurance AI company, deployed fine-tuned 7B Mistral Instruct v2 models using the vLLM framework for high-accuracy insurance document processing, outperforming generic models like GPT-4 on specialized tasks. vLLM's optimizations, including PagedAttention and continuous batching, enabled a 25x speed improvement over Hugging Face, achieving up to 130 tokens/second throughput on A100 GPUs with 32 concurrent requests. This self-hosted solution provides a cost-effective alternative to third-party APIs, processing 20-30 million documents annually for approximately $30,000.
  • Rovio - Rovio addressed game art asset creation bottlenecks by developing "Beacon Picasso," a generative AI system leveraging fine-tuned diffusion models on AWS SageMaker for training and EC2/G6e instances for inference. This system, which includes LLM-augmented interfaces via Amazon Bedrock, uses proprietary data and artist-in-the-loop workflows to generate non-brand-essential assets like backgrounds. It resulted in an 80% reduction in production time for specific assets and doubled content capacity, allowing artists to focus on higher-value creative work while maintaining brand quality.
  • Rox - Rox developed an AI-powered revenue operating system to unify fragmented sales data from various sources (CRM, marketing, finance) into a governed knowledge graph, addressing information silos that hinder sales teams. This system leverages Amazon Bedrock with Anthropic's Claude Sonnet 4 to power specialized AI agent swarms that orchestrate complex, multi-step workflows like account research, outreach, and opportunity management across web, Slack, and mobile interfaces. The multi-agent orchestration system, Command, decomposes requests into subtasks, sequences external tool invocations, and integrates results, while a multi-layer guardrail system ensures safety and compliance.
  • Salesforce - Salesforce's Hyperforce team, managing over 1,400 Kubernetes clusters, developed a multi-agent AI-powered self-remediation loop to address 1,000+ monthly hours of operational toil and complex root cause analysis challenges. This system, built on AWS Bedrock, uses a manager agent to orchestrate specialized worker agents that gather telemetry, perform RAG-augmented root cause analysis from runbooks, and execute "safe operations" with human-in-the-loop approval via Slack. The implementation reduced troubleshooting time by 30% and saved 150 hours monthly, with future plans to leverage knowledge graphs for more sophisticated problem-solving.
  • Salesforce - Salesforce's engineering team rapidly built "Ask Astro Agent," an AI-powered event assistant for Dreamforce, in five days by migrating to their Agentforce platform and Data Cloud RAG. This agent provided attendees with FAQ answers, schedule management, and session recommendations, leveraging vector and hybrid search, Mulesoft for streaming data updates, and integrated knowledge articles. The project showcased Salesforce's enterprise AI stack for real-time event query handling.
  • Salesforce - Salesforce reduced operational overhead and costs for deploying custom fine-tuned LLMs (Llama, Qwen, Mistral) by migrating from Amazon SageMaker to a serverless architecture using Amazon Bedrock Custom Model Import. This transition, which maintained backward compatibility via a SageMaker CPU proxy layer, resulted in a 30% reduction in model deployment time and up to 40% cost savings through pay-per-use pricing, while demonstrating scalable performance with automatic model scaling.
  • Salesforce - Salesforce resolved critical performance and reliability issues in its AI Metadata Service (AIMS), which caused 400ms P90 latency for metadata retrieval and system outages during database failures, impacting AI inference workflows. They implemented a multi-layered caching architecture, featuring L1 client-side caching for sub-millisecond access and L2 service-level caching for resilience, with configurable TTLs. This reduced metadata fetch latency by over 98%, improved end-to-end P90 latency by 27% (from 15s to 11s), and maintained 65% system availability during complete backend outages.
  • Salesforce - Salesforce optimized GPU resource utilization and costs for its diverse LLM deployments, including CodeGen, which previously suffered from underutilized high-performance instances for large models and over-provisioning for high-traffic medium models. They implemented Amazon SageMaker AI inference components to deploy multiple models on shared endpoints, allowing granular resource allocation, dynamic scaling, and intelligent model packing to maximize GPU efficiency. This approach achieved up to an eight-fold reduction in infrastructure costs while maintaining high performance and reducing operational complexity across their LLM portfolio.
  • Sentry - Sentry implemented a Model Context Protocol (MCP) server to directly integrate its error monitoring platform with AI coding assistants, eliminating the manual copy-paste workflow for debugging. This server exposes tools like get_issue_details and begin_sentry_issue_fix(triggering Sentry's internal AI agent), scaling to 60 million monthly requests for over 5,000 organizations. Key technical lessons included implementing comprehensive observability, using prompt engineering with examples and chaining for tool descriptions, and employing AI-filtered responses to manage context pollution in production.
  • Sentry - Sentry developed a hosted Model Context Protocol (MCP) server to provide Large Language Models (LLMs) with real-time application monitoring data, addressing their inherent limitation in accessing current operational context. This production-ready solution, built on Cloudflare Workers and Durable Objects with OAuth authentication, exposes 16 tool calls enabling AI assistants to retrieve project information, analyze errors, and trigger advanced root cause analysis via Sentry's Seer agent. The integration allows for seamless, context-aware debugging and issue resolution directly within AI-powered development workflows.
  • ServiceNow - ServiceNow developed a multi-agent system to unify fragmented sales and customer success operations, orchestrating complex customer lifecycle workflows from lead qualification to advocacy. This system utilizes LangGraph for modular agent orchestration with a supervisor pattern and specialized subagents, while LangSmith provides granular tracing, debugging, and an LLM-as-a-judge evaluation framework with custom metrics. Currently in the testing phase, it focuses on human-in-the-loop development and automated golden dataset creation for continuous quality assurance.
  • ServiceNow / SLB - ServiceNow and SLB utilized Nvidia DGX Cloud on AWS to develop and deploy distinct foundation models. ServiceNow focused on building efficient 5-15B parameter LLMs for enterprise automation, achieving frontier-level reasoning performance on single GPUs with high GPU utilization via Run AI orchestration. SLB developed domain-specific multi-modal foundation models for seismic and petrophysical data in the energy sector, accelerating scientific data synthesis and interpretation, often requiring customer-specific fine-tuning for trust.
  • Shopify - Shopify deployed Sidekick, an AI assistant for millions of merchants, addressing context window limitations, cost, and latency challenges in its agentic architecture with over 20 tools. Their "context engineering" approach involved aggressive token management, a three-tier memory system (explicit preferences, implicit profiles, episodic RAG), and just-in-time instruction injection. These methods reportedly improved instruction adherence by 5-10%, reduced jailbreak attempts, and maintained performance for complex multi-step workflows.
  • Shopify - Shopify developed Sidekick, an agentic AI assistant, addressing scaling challenges like the "tool complexity problem" by implementing Just-in-Time (JIT) instructions for dynamic, localized guidance instead of monolithic system prompts. They built a sophisticated LLMOps evaluation infrastructure using human-correlated LLM-as-a-Judge on Ground Truth Sets and an LLM-powered merchant simulator for pre-production testing. During GRPO fine-tuning, they encountered and mitigated reward hacking by iteratively refining procedural validators and LLM judges, ensuring robust and reliable agent behavior.
  • Shopify - Shopify developed a Global Catalogue using multimodal LLMs to standardize billions of fragmented product listings from millions of merchants, transforming unstructured data into a coherent, machine-readable format for AI-driven commerce. Their four-layer architecture processes over 10 million daily updates, leveraging multi-task vision language models and a novel selective field extraction technique during fine-tuning to achieve 40 million daily inferences with 500ms median latency and 40% reduced GPU usage. This system powers enhanced search, recommendations, and conversational commerce experiences across Shopify's ecosystem.
  • Sicoob / Holland Casino - Sicoob, a Brazilian financial institution, deployed LLMs on Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, and vLLM to run models such as Llama and Mistral for cost-efficient inference in a highly regulated environment. In contrast, Holland Casino, a Dutch gaming operator, utilized Amazon Bedrock and Agent Core with Anthropic Claude and Strands agents for rapid development of management insight tools. Both organizations successfully implemented compliant GenAI solutions in strict regulatory frameworks by prioritizing security, governance, and responsible AI practices.
  • Siteimprove - Siteimprove scaled its digital accessibility and content intelligence platform by evolving from generative AI to production-grade agentic AI, processing tens of millions of monthly requests with enterprise security and cost efficiency. They built an AWS Bedrock-based AI accelerator architecture, leveraging Amazon Nova models for a 75% cost reduction on certain workloads, to support batch processing, conversational remediation, and multi-agent orchestration across various domains. This systematic approach, progressing through human-in-the-loop stages to autonomous operations, enabled the company to deliver measurable business value and achieve market leadership.
  • Sixt - Sixt transformed its global customer service using generative AI (Project AIR) on Amazon Bedrock with Anthropic Claude models, automating email classification, response generation, and chatbot interactions across 100+ countries. This implementation achieved over 90% classification accuracy (up from 70%) and a 70% cost reduction for classification, integrating with backend systems to handle multi-language inquiries. The solution rapidly moved from ideation to production in five months, expanding from email automation to comprehensive messaging and chatbot capabilities.
  • Skai - Skai developed Celeste, a generative AI assistant, leveraging Amazon Bedrock Agents and Anthropic Claude 3.5 Sonnet V2 to enable natural language analytics for complex advertising data. The architecture includes a custom Tool API layer ensuring strict data isolation and privacy, while addressing LLMOps challenges like reducing latency from 136s to 44s and managing token limits with dynamic session chunking and RAG. This implementation significantly improved data analysis efficiency, reducing report generation time by 50% and case study creation by 75%, transforming weeks-long processes into minutes.
  • Slack - Slack scaled its generative AI features (Slack AI) to millions of users by migrating from a costly, inflexible SageMaker provisioned throughput architecture to Amazon Bedrock's on-demand infrastructure, meeting FedRAMP Moderate compliance and enabling access to diverse LLMs. This infrastructure optimization, coupled with a rigorous quality evaluation framework utilizing automated metrics, LLM judges, and guardrails, resulted in over 90% infrastructure cost savings (exceeding $20M), a 5x increase in operational scale, and 15-30% user satisfaction improvements while maintaining quality.
  • Slack - Slack's DevEx team integrated generative AI into internal developer workflows, evolving from SageMaker experimentation to production-grade systems on Amazon Bedrock, achieving a 98% cost reduction. They deployed AI coding assistants, seeing 99% adoption and a 25% increase in pull request throughput, and developed an agentic escalation bot (Buddybot) handling over 5,000 monthly requests. This multi-agent system uses AWS Strands for orchestration, Claude Code sub-agents, Temporal for workflow durability, and MCP servers for tool access, demonstrating a pragmatic LLMOps approach with focus on security, observability, and model agnosticism.
  • Slack - Slack migrated 15,500 Enzyme test cases to React Testing Library for a React 18 upgrade, a task estimated at over 10,000 engineering hours. They implemented a hybrid pipeline using AST transformations and Anthropic's Claude 2.1, which gathered DOM trees and performed partial AST conversions with annotations before LLM processing for complex cases. This approach achieved an 80% correct conversion rate, with 22% of test cases passing immediately, significantly reducing manual effort and enabling the large-scale migration.
  • Smartling - Smartling operates an enterprise-scale AI-first translation platform for major corporations, employing agentic workflows with multi-step validation and automated post-editing to ensure high-quality, consistent translations across diverse content types. The platform utilizes a hybrid approach, integrating various LLMs, NMT, RAG for contextual grounding, and sophisticated prompting to address enterprise challenges like compliance, brand voice, and automation. This results in significant improvements in translation throughput, cost reduction, and quality approaching human parity for suitable language pairs and content.
  • Snorkel - Snorkel developed an AI agent platform and benchmark for commercial insurance underwriting, leveraging LangGraph and ReAct agents with Model Context Protocol to simulate complex enterprise environments requiring multi-tool coordination and domain expertise. Evaluation across frontier models revealed significant challenges, including a 36% tool use error rate, hallucination of generic domain knowledge over proprietary guidelines, and wide performance variance (single digits to 80% accuracy) across different underwriting tasks. This underscores the need for robust error handling and careful domain knowledge integration in enterprise LLM deployments.
  • Snorkel - Snorkel developed an agentic AI copilot benchmark for commercial insurance underwriting, simulating real-world scenarios requiring multi-tool integration, complex reasoning over proprietary knowledge, and multi-turn conversations to assist junior underwriters. Built with LangGraph and ReAct agents, the evaluation revealed significant performance variations (single digits to ~80% accuracy) across models, with common errors including tool use failures (36% of conversations) and domain-specific hallucinations (15-45% for some models).
  • Snowflake - Snowflake developed a multi-step AI agent workflow to integrate structured database content with unstructured document repositories for enterprise use cases, addressing the limitations of traditional RAG for structured data. This architecture employs semantic models to map business terminology to database schemas, abstracts complex data models, and classifies query patterns before generating SQL. A prototype with Whoop demonstrated the agent's ability to combine diverse data sources, like sales figures and Slack conversations, to provide real-time business intelligence and recommendations for complex queries.
  • Snowflake - Snowflake optimized vLLM for high-throughput embedding inference by addressing CPU-bound bottlenecks in tokenization and serialization that caused poor GPU utilization. They implemented three key technical solutions: encoding embedding vectors as little-endian bytes with NumPy for faster serialization, disaggregating tokenization into a parallel pipeline, and running multiple model replicas on a single GPU. These optimizations resulted in a 3x production throughput increase and up to 16x throughput gains in benchmarks, significantly reducing operational costs.
  • So Energy - So Energy, a UK energy retailer, transformed its fragmented contact center operations by implementing Amazon Connect, a unified AI-powered platform that integrated voice, chat, and email with features like automatic identity verification, contact summarization, and intelligent routing. This deployment reduced call wait times by 33%, increased chat channel adoption from under 1% to 15% of contacts, and improved customer satisfaction, while also laying the groundwork for future agentic AI capabilities.
  • Sorcero - Sorcero developed a generative AI system to accelerate secondary manuscript generation in life sciences, specifically for patient-reported outcomes, reducing turnaround times from months to hours. The system ingests clinical study data and protocols to produce foundational drafts, focusing on rigorous LLMOps, traceability to source, regulatory compliance (e.g., CONSORT), and human-in-the-loop validation. It ensures scientific accuracy, controls for hallucinations, and provides audit trails to meet high-stakes, regulated industry standards.
  • Sourcegraph - Sourcegraph evolved from code search to enterprise AI coding assistants (Cody) and then to agentic systems (AMP), delivering 30-60% developer productivity gains for Fortune 500 clients. Their technical approach shifted from sophisticated RAG for chat-based LLMs to multi-model agentic architectures that leverage tool-calling loops, enabling rapid iteration and a re-evaluation of traditional development processes. This strategy emphasizes building application scaffolds to generate new training data for future models, while navigating enterprise data privacy constraints that prevent using proprietary code for general model training.
  • SpeakEasy - SpeakEasy developed an automated generator for Model Context Protocol (MCP) servers from OpenAPI specifications, enabling AI agents to interact with existing APIs by addressing critical production challenges like tool explosion and complex data formats. Their three-layer architecture uses custom OpenAPI extensions for aggressive tool pruning and LLM-optimized descriptions, an intelligent generator for automatic complex data transformation (e.g., Base64 encoding, stream buffering), and custom function files for precise control and scope-based access, ensuring secure and efficient LLM consumption.
  • Splunk - Splunk implemented end-to-end observability for its RAG-powered AI Assistant, which answers .conf24 FAQs, by instrumenting structured logs across the prompt, retrieval, and generation pipeline. They leveraged Splunk Observability Cloud for unified dashboards to monitor response quality, latency, source reliability, and cost, alongside proactive alerts for quality degradation and potential hallucinations. This comprehensive approach enabled efficient root cause analysis, reduced mean time to resolution, and established robust governance for the production LLM system.
  • Spotify - Spotify addressed extensive codebase maintenance and the limitations of deterministic migration scripts by implementing an LLM-powered agentic system. This system autonomously generates and iteratively refines code transformations based on natural language prompts, leveraging existing CI/CD for multi-dimensional automated verification (build, test, lint, "LLM as judge") within a continuous feedback loop. This approach enabled over 1,000 merged production PRs in three months, democratizing complex refactors and allowing non-experts to perform large-scale codebase standardization.
  • Spotted Zebra - Spotted Zebra, an HR tech company, implemented a robust LLM evaluation system to scale its AI-powered interview intelligence product from research to production, addressing rapid iteration and quality assurance challenges. Their framework includes codifying human judgment with golden examples for skill extraction, systematically versioning prompts via a custom YAML format and model gateway, and using LLM-as-a-judge for open-ended tasks like question generation. This comprehensive approach, complemented by adversarial testing, detailed API logging (migrating from LangSmith to S3), and treating evaluation as a strategic capability, resulted in faster development, improved product quality, enhanced client trust, and ISO 42001 certification.
  • Stack Overflow - Stack Overflow responded to ChatGPT's disruption by forming an AI team, which first developed a conversational search feature evolving through keyword, semantic, and RAG implementations, but ultimately rolled it back due to insufficient accuracy (below 70%) for developer expectations. Simultaneously, they built a data licensing business by creating technical benchmarks to demonstrate how fine-tuning LLMs with their high-quality, community-validated Q&A data significantly improved model performance, establishing a new revenue stream.
  • Stripe - Stripe developed an AI agent system to automate compliance review investigations, addressing challenges of manual data navigation and jurisdictional complexity in financial services. Utilizing ReAct agents within a dedicated Agent Service, orchestrated via a DAG of bite-sized tasks and powered by Amazon Bedrock with an internal LLM proxy, the system performs tool-calling across diverse data sources. This human-in-the-loop solution achieved a 96% helpfulness rating and a 26% reduction in average handling time for reviews, ensuring auditability and scalable operations without increasing headcount.
  • Stripe - Stripe developed a domain-specific, transformer-based foundation model for payments, processing tens of billions of transactions in under 100ms to improve card-testing fraud detection from 59% to 97%. They also launched the Agentic Commerce Protocol with OpenAI to standardize agent-driven commerce and achieved massive internal AI adoption, with 8,500 employees using LLM tools daily and engineers reducing payment integrations from two months to two weeks.
  • SuperLinked - SuperLinked shares production insights from deploying large-scale vector search systems, emphasizing challenges in relevance, latency, and cost for indexes up to 2 terabytes. Key solutions include avoiding vector pooling, fine-tuning embeddings with triplet loss for domain-specific relevance, combining sparse and dense representations, and leveraging graph embeddings. Their multi-encoder framework integrates text, graph, and metadata signals directly into retrieval to improve precision and efficiency, avoiding common pitfalls like excessive reranking or unconstrained query generation.
  • Swedish Government Offices - The Swedish Government offices deployed AI assistants across departments, focusing on cognitive enhancement for civil servants through a business-led, rapid experimentation approach. Utilizing a RAG architecture on the Intric platform with multiple LLMs, they achieved significant efficiency gains, such as reducing company analysis from 24 to 6 weeks, while navigating challenges like information governance, cost management, and the need for advanced prompting skills. This initiative prioritized human accountability and transparent sharing of both successes and failures to scale GenAI adoption in a highly regulated environment.
  • Swedish Tax Authority - The Swedish Tax Authority systematically adopted LLMs and AI, focusing on NLP applications like text categorization, OCR, and RAG-based Q&A systems. They benchmarked open-source models (Llama 3.1, Mixtral 7B) against commercial ones (GPT-3.5), finding open-source suitable for simpler queries and commercial for complex ones. Due to sensitive data and regulatory needs, they prioritize on-premise deployment and are building shared AI infrastructure for the public sector.
  • Swiggy - Swiggy transformed Hermes from a basic text-to-SQL assistant into a sophisticated conversational AI data analyst, democratizing data access for employees. This involved implementing a vector-based prompt retrieval system for few-shot learning, conversational memory for context retention, and an agentic workflow for complex query resolution, significantly boosting accuracy from 54% to 93%. An explanation layer was also added to provide transparency and build user trust in the generated SQL and insights.
  • Swisscom - Swisscom implemented Amazon Bedrock AgentCore to scale enterprise AI agents for customer support and sales, addressing challenges in secure multi-agent orchestration, cross-departmental authentication, and strict data protection compliance. Leveraging AgentCore's Runtime, Identity, and Memory services with the Strands Agents framework, they deployed B2C agents that achieved rapid development (3-4 weeks to demo), handled thousands of monthly requests with low latency, and enabled secure agent-to-agent communication while maintaining regulatory compliance.
  • Swisscom - Swisscom deployed fine-tuned LLMs for high-volume, latency-sensitive customer service contact centers, addressing requirements for sub-second response times and scalability. They utilized AWS SageMaker to fine-tune a Llama 3.1 8B model with LoRA and synthetic data, deploying it via infrastructure-as-code using AWS CDK for reproducible operations. This achieved a median production latency under 250ms and accuracy comparable to larger models, efficiently handling 50% of voice channel traffic with full model lifecycle control.
  • Swisscom - Swisscom implemented an AI-powered Network Operations Assistant using a multi-agent RAG architecture on Amazon Bedrock to reduce the 10% of time network engineers spent on manual data gathering and analysis. This system features specialized agents for documentation and precise calculations, translating natural language queries into SQL against an AWS-based ETL pipeline for accurate numerical insights. The solution projects a 10% reduction in engineer time while maintaining stringent data security and compliance for the telecommunications sector.
  • Syngenta - Syngenta implemented "Wingman," an intelligent document processing system, to automate the analysis of over one million invoices annually. Leveraging Amazon Bedrock Data Automation for document parsing and Anthropic Claude via Amazon Bedrock for policy comparison, the system initially focused on automating complex tax compliance checks for 4,000 monthly invoices in Argentina. Wingman extracts invoice data, compares it against tax policies, identifies discrepancies with explanations, and is now scaling to other use cases like spend reduction and vendor data accuracy.
  • Tellius - Tellius developed a production-grade agentic AI analytics platform by strictly separating LLM-based language understanding from deterministic execution to address enterprise challenges like large schemas, security, and consistency. Their architecture validates LLM proposals into a typed Abstract Syntax Tree (plan artifact) against semantic models and policies before compiling to SQL, ensuring determinism, auditable explanations, and multi-step consistency. This approach enables sub-second interactive latency, manages ambiguity, and provides transparent, trustworthy insights over complex enterprise data, avoiding the pitfalls of generic LLM orchestration frameworks.
  • The Australian Epilepsy Project - The Australian Epilepsy Project (AEP) deployed an AWS-based platform integrating multimodal patient data to enhance epilepsy diagnosis and treatment. It utilizes LLMs for converting free-text patient data to structured formats, enabling natural language querying of medical histories via RAG with a specialized medical LLM and PGVector, and generating patient summaries. This comprehensive AI system, including automated fMRI analysis, achieved a 70% reduction in diagnosis time for language laterality mapping, a 10% higher lesion detection rate, and improved patient outcomes like an 8% reduction in seizures.
  • The case study discusses general "Enterprise Infrastructure Challenges for Agent - Deploying agentic AI systems in production necessitates a fundamental re-architecture of enterprise infrastructure, as their autonomous, non-deterministic nature breaks traditional networking, load balancing, and health checking paradigms. Key challenges include managing massive, unpredictable loads, new security vulnerabilities like prompt injection, and the need for semantic caching, comprehensive observability of every agent step, and robust evaluation frameworks for systems with high unknown unknowns. This requires integrating new AI-specific components like prompt management and agent orchestration into the data stack, often with human oversight for critical actions.
  • The case study does not name a specific company. - This case study details a production multi-agent LLM system designed for detecting and correcting misinformation at scale on social media platforms. It employs a centralized orchestrator coordinating five specialized agents: an Indexer for authentic data sourcing, an Extractor for adaptive RAG, a Classifier for misinformation type categorization, a Corrector for reasoning and correction generation, and a Verifier for final validation. The system prioritizes high precision and recall over cost and latency, utilizing comprehensive evaluation, continuous monitoring, and optimization techniques like model distillation and semantic caching to combat evolving misinformation threats.
  • The customer's document processing platform - Deepsense AI built a multi-agent system using Pydantic AI and Anthropic's Claude models for a client's large-scale document processing platform. The system standardized platform capabilities via custom MCP servers and extracted structured data from unstructured documents on demand, dynamically generating database schemas and integrating with Databricks. Key technical lessons included LLM-first API design, token optimization, comprehensive observability with Logfire, and robust testing with Pydantic Evals for production readiness.
  • Thomson Reuters - Thomson Reuters modernized over 400 legacy .NET Framework applications (500M+ lines of code) to modern .NET Core/8/10, driven by high Windows licensing costs and slow manual processes. They adopted AWS Transform for .NET, an agentic AI system powered by Amazon Bedrock LLMs, which automated analysis, dependency mapping, code transformation, and validation. This enabled processing over 1.5 million lines of code per month across 10 parallel projects, significantly reducing modernization timelines and freeing developers for innovation.
  • Thomson Reuters - Thomson Reuters evolved from basic AI assistants to sophisticated agentic systems for legal, tax, and compliance, where accuracy is paramount. They manage agency through a "dial" framework (autonomy, context, memory, coordination) and integrate agents with legacy systems by decomposing existing applications into tools. This enables production use cases like end-to-end tax return generation and multi-source legal research, with evaluation challenges addressed by accounting for human expert variability.
  • Thoughtly - Thoughtly builds and scales conversational voice AI agents for enterprise go-to-market, orchestrating real-time speech-to-text, large language models (GPT-4), and text-to-speech to achieve sub-second latency for natural interactions. The platform employs sophisticated optimization techniques like speculative computing, selective LLM bypass via vector similarity, and parallel vendor calls, supported by robust evaluation frameworks and infrastructure designed for HIPAA/SOC 2 compliance and millions of calls. This enables accurate conditional navigation and integration with enterprise CRMs, delivering significant business outcomes.
  • Thumbtack - Thumbtack automated its manual, generic Search Engine Marketing (SEM) ad creation process, which previously resulted in 80% generic assets, by implementing a multi-stage LLM pipeline. This pipeline generates, reviews using an "LLM as judge" approach, and groups personalized Google Responsive Search Ad (RSA) headlines and descriptions, adhering to strict character limits and incorporating specific keywords. The solution led to statistically significant improvements in impressions, click-through rates, and conversion value, with early phases showing a 20% traffic and 10% conversion increase, while maintaining return on ad spend.
  • Tinder - Tinder implemented two production GenAI systems: a username detection feature using a fine-tuned Mistral 7B with LoRA to achieve near-perfect recall for identifying off-platform handles in user bios, and a personalized match explanation system leveraging a fine-tuned Llama 3.1 8B with SFT/DPO to generate coherent, low-hallucination match rationales. These self-hosted solutions rely on optimized GPU infrastructure and multi-model serving with Lorax for dynamic adapter loading, demonstrating that fine-tuning is crucial for domain-specific performance and addressing complex LLMOps challenges at scale.
  • Together AI - Hassan El Mghari rapidly prototypes and scales AI applications to millions of users by employing a simplified LLMOps approach centered on single API calls to open source models via Together AI. His serverless stack, including Next.js, Neon, and Vercel, facilitates rapid iteration and integration of new models, enabling applications like text-to-app builders and image generators to process millions of requests and outputs. This strategy demonstrates that high-scale production can be achieved with streamlined architectures and strategic use of open source models, often leveraging partnerships for cost efficiency.
  • Toqan - Toqan deployed a natural language data analyst agent that translates user questions into SQL queries and visualizes results, but faced significant production challenges like unreliable outputs, infinite loops, and hallucinated SQL. They addressed these by implementing deterministic pre-validation steps for questions and SQL, integrating domain experts for continuous system curation, building resilient systems with hard/soft limits for unexpected inputs, and optimizing agent tools for focused context and error handling. This hybrid approach, combining LLM capabilities with traditional engineering, enabled the agent to scale reliably to hundreds of users.
  • Toyota - Toyota, in partnership with IBM and AWS, developed an AI-powered system to enhance supply chain visibility and ETA prediction for vehicles. This solution leverages machine learning models like XGBoost and random forest on Amazon SageMaker for time series forecasting and regression, processing real-time events via Kafka and performing batch inference every four hours. It also incorporates a generative AI chatbot for natural language queries, providing customers and dealers with accurate, transparent vehicle delivery status.
  • Toyota Motor North America / Toyota Connected - Toyota Motor North America and Toyota Connected developed an enterprise generative AI platform to provide dealership sales staff and customers with immediate, authoritative vehicle information, addressing the challenge of highly informed buyers. The initial production system (v1) is a RAG-based architecture leveraging Amazon Bedrock, SageMaker, and OpenSearch, processing over 7,000 monthly interactions with robust security, legal compliance via stream splitting, and a sophisticated evaluation pipeline. A planned transition to an agentic platform (v2) using Amazon Bedrock AgentCore aims to eliminate data stillness, enable action-oriented capabilities like inventory checks, and streamline the complex ETL process.
  • TP ICAP - TP ICAP developed ClientIQ, an AI-powered CRM assistant, to efficiently extract insights from vast unstructured meeting notes and structured data within their Salesforce CRM, addressing the challenge of manual, time-consuming data retrieval. This solution utilizes intelligent query routing to direct user requests to either a RAG workflow for unstructured data (leveraging Amazon Bedrock Knowledge Bases with hybrid search and custom chunking) or a text-to-SQL workflow for structured data. ClientIQ integrates enterprise security with permission-based access and employs automated evaluation pipelines, leading to a 75% reduction in research time and improved insight quality.
  • TPConnects - TPConnects transformed its legacy travel booking system into a production-ready AI agent platform on Amazon Bedrock, utilizing a supervised multi-agent architecture to manage the entire travel journey from shopping to customer service. The system employs Claude 3.5 Sonnet, extensive prompt engineering, and a knowledge base for domain-specific terminology like IATA codes, while addressing challenges such as high-volume API response latency through chunking and orchestrating complex multi-API transactions. This solution extends beyond web chat to WhatsApp for proactive disruption management, providing a rich, conversational user experience.
  • Traeger Grills - Traeger Grills transformed its underperforming contact center (35% CSAT) into an AI-powered system using Amazon Connect, implementing generative AI for automated case note generation, customer email composition, and a conversational IVR for administrative tasks. This technical overhaul, which included a "self-healing contact center" for autonomous load management and a unified agent interface, resulted in 92-93% CSAT, a 40% reduction in new hire training, and improved agent satisfaction by augmenting human capabilities for relationship-focused interactions.
  • Trainline - Trainline implemented an AI-powered agentic travel assistant to address post-purchase customer needs during rail journeys, such as real-time disruptions and practical questions. This system uses a central orchestrator with tools like RAG over 700,000 pages of content, real-time train APIs, and refund processing, enabling it to handle diverse queries and seamlessly hand off to human agents. Launched in five months, it now serves 300,000 monthly active users, revealing latent customer demand for mid-journey support.
  • Treater - Treater developed a multi-layered LLM evaluation pipeline for production content generation, integrating deterministic rule-based checks, LLM-based evaluations requiring step-by-step explanations, an automatic rewriting system, and human edit analysis for continuous feedback. This pipeline prioritizes observability, uses binary pass/fail evaluations for actionable insights, and systematically leverages human feedback to reduce the gap between LLM-generated and human-quality outputs. The system ensures high-quality content at scale by continuously improving generation prompts and settings based on real-world performance and human edits.
  • Trellix - Trellix and AWS implemented an agentic AI-powered Security Operations Center (SOC) to automate threat detection and response, addressing the overwhelming volume of security alerts that exceed human analyst capacity. This multi-agent system, built on AWS Bedrock, employs a tiered model strategy (e.g., Nova Micro for classification, Claude Sonnet for analysis) to dynamically investigate alerts, correlate data across diverse security tools, and generate comprehensive, auditable incident reports.
  • Trunk - Trunk engineered an AI DevOps agent for root cause analysis of CI test failures, addressing LLM nondeterminism by focusing on a narrow scope and pragmatically switching models (Claude to Gemini) for improved tool calling. They ensured reliability through comprehensive testing with mocked LLM responses, strict input/output validation, LangSmith observability, and continuous feedback loops, resulting in an agent that reliably provides actionable insights to developers via GitHub PRs.
  • Tyson Foods - Tyson Foodservice deployed an AI-powered conversational search assistant to improve product discovery and direct engagement with B2B operators, replacing inefficient keyword search. The solution integrates semantic search using Amazon OpenSearch Serverless with Amazon Titan embeddings and an agentic conversational interface built with Anthropic's Claude 3.5 Sonnet on Amazon Bedrock and LangGraph. This system allows foodservice professionals to find products using natural culinary terminology, even when it varies from catalog descriptions, and captures high-value customer interactions for business intelligence.
  • Uber - Uber enhanced its internal LLM-powered on-call copilot, Genie, by transitioning from traditional RAG to an Enhanced Agentic RAG (EAg-RAG) architecture to address significant response accuracy issues in the engineering security and privacy domain. The EAg-RAG system incorporates an enriched document processing pipeline with custom Google Docs loaders and LLM-powered content formatting, alongside agentic pre- and post-processing steps for query optimization, source identification, and context refinement. This technical upgrade resulted in a 27% relative increase in acceptable answers and a 60% relative reduction in incorrect advice, enabling production deployment and reducing subject matter expert support load.
  • Uber - Uber's uReview is an AI-augmented code review system employing a modular, multi-stage GenAI architecture with prompt-chaining to decompose code review into comment generation, filtering, validation, and deduplication. This system addresses reviewer overload and missed issues at scale by analyzing over 90% of Uber's 65,000 weekly code diffs. It achieves a 75% usefulness rating for its comments, with 65% of them being addressed by engineers.
  • Uber - Uber's PerfInsights system automates Go code performance optimization by integrating fleet-wide profiling data with LLM analysis. It identifies hotpath functions, feeds their source code and an antipattern catalog to LLMs, and employs advanced prompt engineering alongside a multi-layer validation pipeline (LLM juries, rule-based LLMCheck) to achieve high accuracy and reduce false positives from over 80% to the low teens. This approach has significantly reduced engineering time for performance issue resolution by 93% and led to hundreds of merged optimization diffs, improving codebase health and reducing compute costs.
  • Uber - Uber's Finch is a conversational AI data agent that enables financial analysts to retrieve data from multiple platforms using natural language queries, eliminating the need for complex SQL or manual navigation. Its architecture features a multi-agent orchestration via LangChain's Langgraph, a semantic layer with an OpenSearch index for natural language aliases to enhance SQL generation accuracy, and curated single-table data marts for simplified LLM interaction. The system integrates through a modular Generative AI Gateway, incorporating comprehensive multi-layered evaluation and role-based access control for enterprise deployment.
  • Uber AI Solutions - Uber AI Solutions developed an LLM-powered system, Requirement Adherence, to improve data labeling quality by shifting validation leftward from post-labeling checks to real-time enforcement within their uLabel tool. This system extracts atomic validation rules from client SOPs using LLMs, categorizes them by complexity, and then intelligently routes them to different LLM models for parallel, in-tool validation, leveraging techniques like prefix caching. This approach significantly reduced audits by 80% and enhanced efficiency, all while ensuring data privacy through stateless LLM interactions.
  • UC Santa Barbara - UC Santa Barbara implemented an AI-powered chatbot, "Story," leveraging a RAG-based platform (Gravity) that daily crawls and indexes university websites to provide student support across 19 departments. This system uses an 85% confidence threshold for responses, employs generative AI as a fallback within a "closed AI" model with PII scrubbing, and has handled nearly 40,000 conversations, with 30% occurring outside business hours. The phased rollout included student testing for language optimization and a "shared brain" knowledge architecture to reduce staff workload and improve service availability.
  • UCLA Anderson School of Management - UCLA Anderson School of Management implemented an AI-driven student services application using an agentic framework to provide personalized, prescriptive career guidance for MBA students. This involved consolidating disparate data sources like student records, career placement, and course catalogs, while meticulously adhering to UC system security policies for sensitive student data. The multi-agent system recommends specific courses, internships, and clubs based on a student's stated career objectives, developed over an 8-month period.
  • University of California Los Angeles - UCLA deployed a real-time generative AI system for an immersive theater performance, enabling up to 80 concurrent users to sketch on phones and generate 2D images or 3D meshes displayed as digital scenery. The serverless-first architecture leveraged 24 Amazon SageMaker AI endpoints for custom models and Amazon Bedrock for foundation models, orchestrated by AWS Lambda, to achieve sub-2-minute round-trip processing with zero tolerance for failure during live shows. This hybrid approach successfully supported 7 performances, demonstrating production-grade generative AI for interactive entertainment while highlighting challenges in cost management and infrastructure-as-code.
  • US Bank - US Bank implemented a generative AI solution in their contact centers, leveraging Amazon Connect, Contact Lens for real-time transcription, and Amazon Q in Connect for intent detection and orchestration. The system uses Amazon Bedrock with Anthropic's Claude for retrieval-augmented generation against tagged knowledge bases, providing agents with real-time, on-demand recommendations to reduce manual searches and improve call handling. This human-in-the-loop approach, currently in a production pilot, aims to enhance efficiency and automate post-call tasks.
  • Vellum - Vellum developed a natural language agent builder, an LLM that creates other LLM-based agents from conversational descriptions, democratizing agent development beyond traditional coding or visual interfaces. Key LLMOps practices include designing high-level tool abstractions to minimize agent errors, employing a pragmatic testing strategy combining qualitative and rigorous methods, and leveraging comprehensive execution monitoring to iteratively improve agent performance. The system also enhances user experience by parsing agent-generated structured text into interactive UI elements and offers flexible deployment options like SDK code, API endpoints, or one-click applications.
  • Veradigm - Veradigm integrated AWS HealthScribe and HealthLake into its Practice Fusion EHR to combat clinician burnout by automating clinical documentation. HealthScribe generates structured notes from patient-clinician conversations, enhanced by patient context from the FHIR-compliant HealthLake, processing 60 million annual visits. This integration saves clinicians approximately two hours daily on documentation, achieves a 65% no-training adoption rate, and improves patient focus, with a goal of zero-edit note generation.
  • Vercel - Vercel successfully deployed three production AI agents for internal workflow automation by systematically identifying "boring, repetitive" tasks employees disliked, countering the high failure rate of AI projects. These agents include a lead processing agent that automated sales qualification, an anti-abuse agent reducing content moderation time by 59%, and a data analyst agent for natural language SQL querying, all integrated with existing systems like Salesforce and Slack. Their core methodology involved asking employees "What do you hate most about your job?" to pinpoint low-cognitive-load tasks suitable for current LLMs, leveraging a specialized, human-in-the-loop architecture built with the Vercel AI SDK.
  • Volkswagen Group - Volkswagen Group Services developed an AI platform with AWS to automate automotive marketing content generation and compliance, addressing slow manual processes, pre-production vehicle confidentiality, and global regulatory bottlenecks. The solution employs LLMs for prompt enhancement, fine-tuned diffusion models (DreamBooth, LoRA) on proprietary vehicle imagery (including CAD digital twins) for brand-accurate image generation, and multi-stage evaluation using vision-language models for automated component accuracy and brand guideline compliance. This significantly reduced content production time from weeks to minutes, ensured automated compliance, and provided a reusable platform for various organizational use cases.
  • Vxceed - Vxceed developed an LLM-powered multi-agent system on AWS Bedrock (Anthropic Claude 3.5 Sonnet) to generate personalized sales pitches for CPG loyalty programs, addressing low retailer adoption in emerging markets. This production-scale solution, leveraging RAG, serverless architecture, and guardrails, achieved 95% response accuracy and 90% query automation, leading to a 5-15% increase in program enrollment and significant reductions in processing and support times.
  • Wakam - Wakam, an insurance company, initially struggled with in-house RAG chatbot development due to maintenance burden and low adoption for addressing knowledge silos. They successfully pivoted to a commercial AI agent platform, integrating diverse enterprise data sources and implementing a dual-layer permission system for secure data access. This strategy, coupled with extensive change management and empowering employees to build 136 agents, led to 70% adoption and a 50% reduction in legal contract analysis time within two months.
  • Wayve - Wayve utilizes end-to-end foundation models for autonomous driving, replacing traditional modular systems with a single neural network that maps sensor inputs directly to driving actions. This architecture, trained on massive diverse data, enabled rapid global scaling to 500 cities within a year and supports multi-modal outputs like driving, simulation, and natural language explanations. The production model is highly compressed to operate within a 75-watt power budget in vehicles, demonstrating zero-shot transfer capabilities to new geographies.
  • Weaviate - Glowe developed a domain-specific agentic AI for personalized Korean skincare recommendations, addressing the limitations of generic systems in understanding ingredient interactions and user outcomes. It employs a dual embedding strategy using Weaviate's named vectors, separating product metadata from TF-IDF weighted effect embeddings derived from 94,500 user reviews processed by Gemma 3 12B. This system leverages Weaviate for vector search and Gemini 2.5 Flash for routine generation, all accessible via an Elysia-powered agentic chat interface for nuanced, context-aware guidance.
  • Wesco - Wesco, a B2B supply chain company, deployed enterprise-scale GenAI and agentic AI by building a composable platform with robust LLMOps, including prompt engineering, fine-tuning with LoRA, and multi-agent architectures for use cases like fraud detection. This involved comprehensive observability using tools like LangFuse and a strong governance framework, resulting in 50+ deployed applications that enhance productivity and optimize supply chain operations.
  • Western Union / Unum - Western Union and Unum Insurance utilized an agentic AI framework, built on AWS Transform and partner solutions (Accenture, Pega), to modernize their legacy COBOL mainframe systems into cloud-native applications. This composable architecture orchestrated specialized AI agents to automate end-to-end transformation, processing millions of lines of code and extracting business rules. Key results included Western Union converting 53,000 COBOL lines to Java in 1.5 hours and halving project timelines, while Unum achieved a 3-month COBOL-to-cloud migration (vs. 7 years) and eliminated 7,000 annual manual hours in claims management.
  • WEX - WEX developed "Chat GTS," a production agentic AI system, to automate over 40,000 annual IT support requests for its Global Technology Services team. Leveraging AWS Bedrock, Agent Core Runtime, and Step Functions, the system employs specialized agents following SOA principles to handle tasks like network troubleshooting and autonomous, event-driven EBS volume management via SSM documents and MCP tools. This platform moved from pilot to production in under three months, now serving over 2,000 internal users by combining chat-initiated interactions with autonomous incident response workflows.
  • WhyHow.ai - WhyHow.ai developed a legal tech system to rapidly identify class action and mass tort cases, claiming to do so in minutes compared to traditional months. Their architecture integrates knowledge graphs to structure scraped web data, multi-agent systems for automated workflows with extensive guardrails, and RAG for generating personalized legal reports from relevant subgraphs. This hybrid approach uses traditional ML for precise filtering and LLMs for system integration, providing law firms with early, targeted case intelligence.
  • Windsurf - Windsurf tackles the challenge of generating contextually relevant code by integrating with large codebases, adhering to organizational standards, and aligning with personal developer preferences. Their solution employs a sophisticated context management system that combines dynamic user behavioral heuristics (e.g., cursor position, file access) with static codebase state (e.g., code, documentation, rules). This system optimizes for context relevance and selection rather than volume, leveraging GPU optimization for efficient, real-time context discovery and processing at scale.
  • Wipro PARI - Wipro PARI implemented an LLMOps solution using Amazon Bedrock with Anthropic Claude models to automate the generation of Programmable Logic Controller (PLC) ladder text code from complex industrial process requirements. This system employs a multi-stage pipeline with advanced prompt engineering, iterative generation, and a hybrid rectification/validation framework to ensure compliance with IEC 61131-3 standards and handle complex logic. The solution reduced code generation time from 3-4 days to approximately 10 minutes per query, achieving an 85% average validation completion and saving 5,000 work-hours by assisting industrial engineers.
  • Wobby - Wobby developed production analytics agents (Quick, Deep, Steward) that integrate with a custom-built semantic layer to provide business teams with data warehouse insights via Slack/Teams. This multi-agent system encodes business logic in the semantic layer, allowing analytics agents to query logical concepts, while a Steward agent maintains it. Their pragmatic approach prioritizes prompt-based logic, comprehensive testing over early evals, and latency optimization for real-world adoption.
  • Woowa Brothers - Woowa Brothers developed QueryAnswerBird (QAB), an LLM-powered AI data analyst utilizing GPT-4, RAG, and LangChain to enhance employee data literacy by converting natural language into SQL queries. Its multi-chain RAG architecture, featuring a Router Supervisor, integrates unstructured data pipelines with vector stores for company-specific knowledge, enabling query generation, interpretation, and data discovery. The system was built with robust LLMOps practices, including over 500 A/B tests, custom evaluation, and monitoring, delivering high-quality SQL responses via Slack within 30-60 seconds.
  • Writer - Writer evolved their enterprise RAG system from vector search, which struggled with accuracy due to chunking and disambiguation in concentrated, specialized data, to a sophisticated graph-based approach. This involved using specialized models for graph conversion, storing graph data as JSON in a Lucene-based search engine for scalability, and implementing fusion-in-decoder techniques to leverage textual relationships for enhanced context. The resulting hybrid system demonstrated superior accuracy and faster response times compared to seven vector search systems in benchmarking, effectively addressing hallucination and performance issues for enterprise knowledge.
  • Xelix - Xelix developed an AI-powered help desk to automate responses for accounts payable teams overwhelmed by vendor inquiries. The system employs a multi-stage pipeline that classifies incoming emails, performs hierarchical vendor identification, and uses machine learning to match extracted invoice details against ERP data. A context-augmented LLM then synthesizes this validated information into pre-generated responses, complete with confidence scores, enabling AP professionals to efficiently review and send accurate replies.
  • Xomnia - Xomnia's GenAI governance framework for production LLM systems addresses access control, unstructured data quality, and LLMOps monitoring. For access control, they implement self-service prototyping with Open WebUI, integrate LLMs into workflows via extensions, and use API gateways for centralized policy enforcement like PII reduction. Data quality involves detecting contradictions and redundancies in knowledge bases using similarity search and LLM-based classification, while LLMOps monitoring utilizes tracing platforms like LangFuse and dynamic golden datasets for continuous testing.
  • Yahoo! Finance - Yahoo! Finance, in collaboration with AWS, developed a multi-agent financial research and question answering system to democratize access to financial insights for retail investors. This system employs a supervisor-subagent architecture, where specialized agents utilize RAG and tool calling to process vast, heterogeneous financial data, including SEC filings and market data. Built on AWS Bedrock Agent Core, it features asynchronous execution, Bedrock Guardrails for safety, and a hybrid human/AI evaluation strategy, achieving production scale with query costs of 2-5 cents and latencies of 5-50 seconds.
  • YouTube - YouTube adapted Google's Gemini LLM for video recommendations by creating "Semantic IDs" to tokenize videos and continuously pre-training the model to understand both natural language and this new video language. This "bilingual LLM" uses generative retrieval to personalize recommendations, yielding significant quality improvements, especially for cold-start scenarios and fresh content. Production deployment at YouTube's scale necessitated over 95% cost reductions through optimizations like offline inference to overcome the prohibitive serving costs of transformer models.
  • Zalando - Zalando deployed a Multimodal LLM-as-a-Judge framework to scale product retrieval evaluation for its e-commerce platform, replacing slow and expensive human annotation. This system automatically generates context-specific guidelines and assesses query-product relevance using both textual and visual data, achieving human-comparable accuracy. It processes 20,000 query-product pairs in 20 minutes, dramatically reducing evaluation time and cost, enabling continuous search quality monitoring in production.
  • Zalando - Zalando's Partner Tech division used an LLM-powered Python tool with GPT-4o to migrate 15 B2B applications from two legacy UI component libraries, addressing significant technical debt. Through iterative prompt engineering and example-driven learning, the tool achieved over 90% migration accuracy at under $40 per repository, substantially reducing manual effort. Despite its effectiveness, human oversight remained crucial for visual verification and handling complex design system differences and LLM limitations like occasional hallucinations.
  • Zalando - Zalando deployed an LLM-based Content Creation Copilot using OpenAI's GPT models to automate product attribute extraction from images, addressing manual content enrichment bottlenecks and quality issues in their e-commerce workflow. The system leverages custom prompt engineering and a translation layer to map LLM output to internal attribute codes, enriching approximately 50,000 attributes weekly with 75% accuracy. This human-in-the-loop solution improves efficiency and data coverage while allowing copywriters to maintain final decision authority.
  • Zalando - Zalando developed a multi-stage LLM pipeline to automate the analysis of thousands of incident postmortems, specifically for datastore technologies, to overcome the scalability limitations of manual review and extract strategic insights on recurring failure patterns. This pipeline, utilizing models like Claude Sonnet 4 and a human-in-the-loop approach, summarizes, classifies, and identifies common failure themes, reducing analysis time from days to hours and achieving 3x productivity. The system uncovered critical patterns, such as misconfigurations and capacity issues, leading to actionable infrastructure investments like automated change validation that prevented 25% of subsequent datastore incidents.
  • Zapier - Zapier developed an AI Agents platform for non-technical users, facing significant challenges in production due to the non-deterministic nature of AI and user behavior. They implemented a "data flywheel" for continuous improvement, featuring comprehensive instrumentation, sophisticated explicit and implicit feedback collection, and a hierarchical evaluation framework. This framework includes unit tests, trajectory evaluations, and A/B tests, prioritizing real user satisfaction over laboratory metrics for iterative refinement of their AI agent system.
  • Zectonal - Zectonal developed a Rust-based AI agentic framework for multimodal data quality monitoring, replacing traditional rules with LLM function tool calling to dynamically detect defects and anomalies. This high-performance framework, deployed as a single binary, supports multiple LLM providers (OpenAI, Anthropic, Ollama) and includes "Agent Provenance" for comprehensive audit trails of agent decisions. It enables flexible cloud or on-premise operations, addressing enterprise needs for security, performance, and transparency.
  • ZenCity - ZenCity developed an AI platform to help local governments synthesize diverse community feedback from sources like social media, surveys, and 311 requests, addressing the challenge of making this data actionable for non-technical officials. Their multi-layered architecture uses custom ML for sentiment and topic modeling, pre-computes data for LLM efficiency, and employs LLM-driven agents with MCP servers to generate on-demand queries and automated, personalized briefs for workflows like budget planning. This system processes millions of data points daily, ensuring accurate, cited insights inform government decisions while maintaining multi-tenancy security.
  • Zillow - Zillow's StreetEasy deployed two LLM-powered features: "Instant Answers" provides pre-generated responses to common property FAQs using BERTopic for topic modeling and chain-of-thought prompting to ensure accuracy and cost-efficiency. The "Easy as PIE" feature generates personalized agent bio summaries based on agent history and user preferences to facilitate better agent-buyer matching. Both implementations prioritize data quality, ethical AI with Fair Housing guardrails, and scalable, cost-effective pre-computation strategies for production use.
  • Zillow - Zillow developed an AI-driven user memory system to dynamically personalize real estate discovery, addressing the challenge of evolving user preferences over long shopping journeys. This system employs a dual-pipeline architecture, combining batch processing for stable long-term preferences with real-time streaming for transient behavioral signals, to create a comprehensive understanding of user intent. It integrates components like preference profiles, recency weighting, affordability models, and embeddings to power personalized search, recommendations, and notifications across the platform.
  • Zoom - Zoom's AI Companion 3.0 is an agentic AI system designed for meeting intelligence and productivity automation, transforming conversations into actionable outcomes through automated planning and execution. It employs a federated AI architecture, leveraging proprietary SLMs for prompt enrichment and cost/latency optimization, alongside third-party frontier LLMs via AWS Bedrock, all built on AWS microservices with OpenSearch for RAG. This enables multi-agent workflows for tasks like cross-meeting analysis and intelligent scheduling, significantly reducing administrative overhead at scale.
  • Zoro UK - Zoro UK, an e-commerce platform with 3.5 million products, implemented DSPy to normalize and sort inconsistent product attributes from 300+ suppliers, a task intractable across 75,000 attribute types. Their production system employs a two-tier architecture: Mistral 8B classifies attributes for sorting type, routing complex semantic sorting to GPT-4, with a Python fallback for simple cases. DSPy automated prompt optimization and provided LLM agnosticism, enhancing product discoverability and user experience by logically ordering attributes.

Start deploying AI workflows in production today

Enterprise-grade AI platform trusted by thousands of companies in production