Madrigal Pharmaceuticals built an enterprise multi-agent platform to integrate, search, and synthesize information from diverse pharmaceutical datasets scattered across structured systems, unstructured documents, and external sources. Using LangChain's DeepAgents framework and LangSmith for observability, evaluation, and deployment, they created a modular skills-based architecture where specialized agents work in parallel under an orchestrator, with all data normalized through consistent tool interfaces. The system reduced development time for new use cases from weeks to hours, achieved production deployment in weeks rather than months, and enabled domain experts to contribute directly to agent skill development while maintaining pharmaceutical-grade accuracy and governance.
Madrigal Pharmaceuticals, a biopharmaceutical company focused on MASH (metabolic dysfunction-associated steatohepatitis) treatments, developed an enterprise-grade multi-agent research and intelligence platform to address a fundamental challenge in pharmaceutical operations: integrating, searching, and synthesizing information from diverse datasets at scale. This case study, published in April 2026, provides detailed insights into how a small team at a pharmaceutical company built and deployed a production LLM system that handles complex research workflows across multiple data sources while maintaining the rigor and governance required in a highly regulated industry.
The platform was designed to enable Madrigal employees to search, analyze, and synthesize relevant data across the enterprise within appropriate access controls, with every response clearly cited. The system incorporates role-based permissions and data governance guardrails to ensure users only access authorized information. What began as a solution to a single use case evolved into a general-purpose agentic platform capable of handling multiple pharmaceutical research workflows. The authors emphasize that the system’s value lies not just in solving individual problems but in creating a reusable pattern that compounds over time.
The core architectural decision was to move away from building a single monolithic system that attempts to handle all complexity in one place. Instead, Madrigal adopted a multi-agent approach where multiple specialized agents work in concert, each focused on doing one thing well while working toward an overarching goal. The system includes agents specialized for different functions: some handle search, others perform analysis, and others synthesize results. Coordinating these agents is an orchestrator that receives tasks and decides what needs to happen next, including which capabilities are required, which agents should run, what should happen in parallel, and when to bring everything back together.
The orchestrator represents a key architectural component where intent meets execution. Critically, it doesn’t need to know the details of every domain—it only needs to know how to route solutions to address problems. This separation maintains system flexibility and allows the platform to grow without increasing the complexity of the orchestrator itself. The orchestrator is built around modular workflows that reflect the varying research methods and requirements across the enterprise, enabling the system to grow by adding new workflows rather than increasing core complexity.
LangChain’s DeepAgents framework provided the foundational agentic harness that enabled this architecture. The framework includes features such as a virtual filesystem, context management, modular skill capabilities, and checkpointing all out of the box. This allowed the Madrigal team to focus on building pharmaceutical-specific intelligence rather than the mechanics of agent coordination. The harness implements a “divide and conquer” approach where intelligence is distributed across specialized agents and then coordinated, which the authors compare to how human brains work.
A critical innovation in the platform’s design is the adoption of a skills-based architecture inspired by Anthropic’s approach to modular capabilities. Instead of hardcoding workflows for specific use cases, Madrigal built each capability as a swappable skill that defines how to approach a type of problem: what to look for, how to reason about it, and what good output looks like. The orchestrator simply loads the right skill at the right time based on the task requirements.
This modular approach had profound implications for development velocity and platform scalability. When a new use case emerged, the team could define a new skill rather than building a new system or rewriting orchestration logic. This transformed development timelines from weeks to hours for new use cases. The skills-based architecture also enabled domain experts at Madrigal—pharmaceutical experts who understand evidence hierarchies, database trustworthiness, and what constitutes good output—to contribute directly to platform development. The feedback loop between domain experts and the system became significantly faster than previous approaches, with users able to identify logic issues and have them addressed quickly due to the modularity of the deep agents harness.
One of the most significant challenges in enterprise pharmaceutical environments is that information is scattered across diverse systems: structured databases, unstructured documents, external sources, and real-time APIs. Each data source behaves differently with different formats, access patterns, and expectations. Madrigal addressed this by making these differences invisible to the agents through a suite of parallelizable tool calls.
The solution involved normalizing every data source into a consistent tool interface. All data, regardless of origin, is normalized, stored in the same secure data warehouse, and made accessible through a uniform interface. From the agent’s perspective, it’s all just information that can be used. This abstraction is what the authors identify as the key enabler for scaling multi-agent systems—it allows the system to operate across domains without rewriting logic each time a new data source is added. The authors emphasize that this abstraction of data sources is what makes multi-agent systems truly scalable in enterprise environments.
The platform leverages parallelism extensively to improve both latency and the scope of analysis. The team recognized early that most complex research work doesn’t need to happen sequentially. When a task involves exploring multiple angles, querying multiple sources, or analyzing different dimensions, there’s no inherent reason to process these serially. Multiple agents run in parallel, each focusing on a different slice of the problem, and by the time the system reconvenes, it combines fully formed pieces of work. The orchestration agent can divide a research question into multiple parts distributed to sub-agents, which can themselves parallelize the analysis of individual datasets within their own work. This approach increases both accuracy and scope while decreasing latency.
To make parallel execution work effectively, the system requires a mechanism to bring pieces back together cleanly. Madrigal heavily leveraged the virtual filesystem built into deep agents, which serves as a shared workspace where every agent can write what it finds and read what others have already done. This filesystem becomes the system’s collaborative memory—every result is stored, every source is tracked, and every intermediate step is available for reuse. Instead of passing information directly between agents, everything flows through this shared layer, keeping coordination simple even as the system scales. The authors provide a concrete example: if a user asks an agent to check all sources, even if hundreds of sources across many datasets have been analyzed and synthesized, the system knows exactly where to find every data point that went into the response.
The authors emphasize that the hardest part of building across multiple data sources isn’t retrieval itself but knowing whether retrieval is working correctly. When a response is incorrect, the challenge is diagnosing whether the problem stems from data not being in the index, the query not surfacing the right information, or the agent trusting a source it shouldn’t have. This diagnostic challenge is particularly acute in pharmaceutical contexts where accuracy is non-negotiable.
LangSmith’s tracing capabilities provided what the authors describe as a transformative level of visibility into agent behavior. They use a striking metaphor: before tracing, they could only observe stimulus-response behavior, not actual cognition. With LangSmith tracing, they describe the experience as “going from basic psychology to neuroimaging” for their agentic AI system—they could finally see what was actually happening inside the brain of the system. Every tool call, every retrieved chunk, and every agent decision became visible and tagged by session ID.
The tracing capability proved essential not just for debugging but for understanding system behavior at a fundamental level. If tracing is neuroimaging for an agent system, the authors explain, LangSmith is the fMRI that shows which regions activated, in what order, and whether the result made sense. This level of visibility enabled the team to diagnose issues precisely and understand how different components of the multi-agent system were interacting in practice.
Madrigal implemented LangSmith’s agent evaluation framework with trace-level evaluations on full agent runs. They designed LLM-as-judge graders to mirror real end-user business feedback forms, scoring outcomes rather than exact paths. This approach reflects a pragmatic understanding that in complex multi-agent systems, there may be multiple valid paths to a correct answer, and what matters most is the quality of the final output.
What the authors identify as the most impactful aspect of their evaluation approach is the feedback loop from production to evaluation: production failures feed back into their LangSmith datasets automatically. Every meaningful error becomes a new test case, causing the evaluation suite to grow from real failures rather than synthetic scenarios. This creates a continuous improvement cycle where the system learns from actual deployment challenges rather than hypothetical test cases constructed during development. This approach is particularly valuable in pharmaceutical contexts where edge cases and failure modes may be difficult to anticipate during initial development but become clear in production use.
For a small team building for enterprise use, Madrigal faced inherent tension between the need for reliability and scalability and the lack of a large engineering team to run complex infrastructure. LangSmith Deploy resolved this tension by allowing them to deploy their agent graph as a managed service with state persistence, concurrent sessions, and real-time streaming to the UI without rewriting core logic.
The deployment infrastructure includes a CI/CD pipeline in GitHub that triggers automatic redeployment, enabling skill updates to ship without manual steps. The authors describe the acceleration as a “welcome surprise”—going from prototype to enterprise use took weeks rather than the months they had budgeted. This rapid deployment capability proved critical for a small team that needed to deliver enterprise-grade reliability without enterprise-scale infrastructure resources.
An unexpected benefit mentioned was ready-to-deploy UI integrations through well-defined API endpoints provided by LangSmith. This removed another common bottleneck for small teams trying to make prototypes accessible to enterprise users. Combined with LangSmith Observability for real-world usage monitoring, the platform enabled the team to identify common use cases and friction points in actual deployment.
The authors articulate that LangChain and LangSmith represent more than just a toolset for their team—they describe it as a “platform philosophy that automates our process.” The integrated developer experience combines common agentic patterns and capabilities (like deep agents, middleware, and skills) so the team isn’t building the agent framework itself but rather building pharmaceutical intelligence on top of it.
Several aspects of this platform philosophy proved particularly valuable. The full pipeline visibility, iterative development cycle, and monitoring provided the community-driven innovation of open source without sacrificing what enterprise-grade production systems demand. The native integration with deployment and scale meant that agents built with LangChain became immediately accessible across the company through LangSmith deployment, addressing what is typically a common bottleneck for small teams.
The authors note that LangChain’s approach represents a rare ability to absorb the best thinking from the open-source community into the framework while staying deeply responsive to the realities of enterprise teams. This continuous feedback loop between developers in the field and enterprise customers using the platform is, in their view, exactly what AI development demands. They specifically highlight stellar enterprise support, active solicitation of platform feedback, and how consistently that feedback surfaces in framework and platform evolution.
The case study emphasizes that the real value of the platform shows up in compounding effects over time. The authors identify several concrete manifestations of this compounding value: development for new use cases that would typically take weeks is reduced to hours; systems that required manual effort become self-sustaining; new use cases don’t require new infrastructure; and most importantly, subsequent deployments happen progressively faster.
The platform is designed to scale across Madrigal’s search, analysis, and research needs, with the modularity of the architecture enabling rapid expansion. The authors note that domain experts can now contribute directly to platform development through the skills-based architecture, accelerating the feedback loop between user needs and system capabilities. When users notice something about the logic of the system, the modularity allows the team to easily address shortcomings without redesigning core components.
From a strategic perspective, the authors emphasize that confidence in their tooling choices freed them to focus on what actually matters: building a system that transforms data into meaningful impact for MASH patients. The executive quotes reinforce this theme—the CIO emphasizes that LangSmith provides end-to-end visibility across the agent lifecycle from development through continuous improvement, which is critical trust for high-stakes pharmaceutical AI systems. The Global Head of AI and Data Science emphasizes that the real ROI of agents shows up when you can reuse patterns rather than just shipping single use cases, and that LangSmith makes this real through managed deployment, tracing, and evals all in one place.
While this case study provides valuable technical detail about building multi-agent systems in pharmaceutical contexts, it’s important to note that it’s published by LangChain as a customer success story and naturally emphasizes the positive aspects of using their platform. The authors are employees of Madrigal and describe LangChain as a “genuine thought partner,” which suggests a close collaborative relationship that may influence the presentation.
Several claims warrant balanced consideration. The assertion that development time for new use cases dropped from weeks to hours is impressive but likely represents best-case scenarios for relatively straightforward new skills rather than fundamentally novel capabilities. The claim that production deployment took weeks rather than months is notable, but the case study doesn’t provide detailed metrics on what “production-ready” means in this context—for instance, what volume of queries the system handles, what accuracy benchmarks were achieved, or how many users actively rely on the system.
The emphasis on abstraction and modularity is architecturally sound, but the case study doesn’t address potential tradeoffs. For example, normalizing all data sources to a consistent interface may involve losing source-specific optimizations or metadata that could improve retrieval quality. The parallel execution approach is powerful but can also introduce complexity in debugging when different agents produce conflicting information or when coordination failures occur.
The evaluation approach of using LLM-as-judge to score outcomes is pragmatic but introduces its own challenges—the judges themselves may have biases or blind spots, and the case study doesn’t discuss how they validate that the automated grading aligns with human expert judgment in pharmaceutical contexts where accuracy is critical. The automatic incorporation of production failures into the evaluation dataset is valuable but could potentially bias the system toward avoiding past failures rather than generalizing well to novel situations.
Despite these caveats, the case study provides genuine value in demonstrating how a small team in a regulated industry successfully deployed multi-agent LLM systems to production. The architectural patterns—particularly data source abstraction, skills-based modularity, and collaborative memory through shared filesystems—represent transferable insights for other organizations facing similar challenges with diverse data sources and complex research workflows. The emphasis on observability and evaluation as core components rather than afterthoughts reflects mature LLMOps thinking that goes beyond simple prototype development.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Medable developed Agent Studio, a comprehensive platform for deploying AI agents in clinical trial operations to address the lengthy drug approval process that currently takes over 10 years. The platform enables both internal teams and customers to build configurable multi-agent systems that tackle problems like document classification in electronic trial master files and clinical research monitoring across multiple data systems. By taking a platform-first approach with support for model-agnostic agents, RAG knowledge integration, MCP connectors, workflow functionality, and robust evaluation frameworks, Medable has deployed multiple agentic applications that help clinical research associates process over 80,000 documents per year and monitor data across 13+ disparate systems, with the ambitious goal of reducing clinical trial timelines from 10 years to one year.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.