This case study explores the architectural challenges of deploying multi-agent AI systems in production, primarily drawing from a financial services credit decisioning system that experienced critical failures due to race conditions and cache invalidation issues. The speaker, a Databricks engineer with experience at AWS, presents distributed systems patterns adapted for multi-agent coordination, including orchestration versus choreography patterns, immutable state management with versioning, circuit breakers for failure recovery, and saga patterns for compensation. The solution involves implementing production-grade architecture using Databricks components including LangGraph for orchestration, Unity Catalog for governance, Delta Lake for state management, and MLflow for observability, resulting in systems capable of running 24/7 across billions of transactions with proper failure handling and rollback capabilities.
This case study presents a comprehensive examination of the challenges involved in scaling from single-agent to multi-agent AI systems in production environments, focusing on lessons learned from deploying credit decisioning systems in financial services. The presenter, Sandy, brings 18 years of experience building distributed data systems at AWS and Databricks, with two years specifically focused on deploying multi-agent AI systems in production. The core thesis is that most multi-agent system failures stem not from poor AI implementation but from inadequate distributed systems architecture. The case study is particularly valuable because it draws direct parallels between traditional distributed systems engineering patterns and multi-agent AI deployment, offering a framework that treats agent coordination as a systems architecture problem rather than purely an AI problem.
The primary case study involves a credit decisioning system deployed for a financial services company. Initially, the system consisted of a single agent performing credit score calculations, which worked flawlessly in demos and maintained zero issues for two weeks in production. However, when the team scaled to five agents encompassing credit score calculation, income verification, risk assessment, fraud detection, and final approval, critical failures emerged within three days of deployment. Twenty percent of credit decisions showed incorrect risk ratings, and customers who should have been flagged for review were being approved, causing significant business panic.
The root cause analysis, which took two days to identify, revealed a classic distributed systems race condition involving cache invalidation. The credit score agent would calculate a score of 750 and write it to the PostgreSQL database, but when the risk assessment agent read the same customer’s score 500 milliseconds later, it retrieved a value of 680. This discrepancy occurred because a caching layer sat between the agents and the database. While writes to PostgreSQL succeeded, the cache was not being invalidated, causing the risk agent to read stale data from the cache rather than the updated value from the database. This led to incorrect risk assessments based on outdated credit scores.
This failure illustrates a fundamental misunderstanding about multi-agent systems: the problem was not with the model quality, prompt engineering, or AI capabilities, but rather with the architecture failing to account for distributed systems concerns like cache coherency, state synchronization, and race conditions. The team had built what appeared to be five independent agents without considering that they had actually created a distributed system with complex coordination requirements.
A critical insight from the case study is that coordination complexity in multi-agent systems grows exponentially, not linearly. Moving from one agent to five agents doesn’t result in five times the complexity but rather approximately 25 times more complexity. With a single agent, there are zero coordination problems. Two agents create at least one connection requiring coordination. Five agents create at least ten potential connections, and each connection represents a potential failure point, race condition, or state synchronization problem. This exponential growth in coordination complexity explains why teams often experience shocking failures when scaling from working single-agent demos to multi-agent production systems.
The case study presents two fundamental patterns for coordinating multiple agents, each suited to different use cases and organizational requirements.
Choreography represents an event-driven, decentralized approach where agents coordinate through events published to a message bus. In this pattern, a research agent completes its work and publishes a “research completed” event. An analysis agent subscribes to this event type, processes the data when the event arrives, and publishes an “analysis ready” event. A report agent then picks up that event and generates the final report. There is no central coordinator managing the workflow; each agent operates autonomously, listening for events it cares about and publishing events when it completes its work.
The advantages of choreography include loose coupling between agents, ease of adding new agents by simply subscribing them to relevant events, high autonomy for individual agents, and excellent scalability. However, the significant disadvantage is debugging complexity. When something fails in a choreographed system, determining which agent failed to publish, whether events were consumed correctly, or if events were consumed multiple times becomes extremely challenging. The case study emphasizes that choreography requires bulletproof observability with strong event tracing capabilities. Without the ability to trace events through the system, debugging becomes nearly impossible.
Choreography is appropriate when workflows are naturally event-driven, when agents need to operate independently, when new agents are being added frequently without wanting to update a central coordinator, and critically, only when strong observability infrastructure is in place. The presenter warns against teams choosing choreography simply because it feels more “agentic” or autonomous, only to spend months firefighting because they cannot effectively debug distributed event flows.
Orchestration provides a centralized alternative where a workflow orchestrator calls each agent directly and manages all coordination. The orchestrator maintains the execution graph, calling Agent A first, waiting for the result, then calling Agents B and C in parallel if appropriate, managing the parallelism itself rather than relying on agents to coordinate. The orchestrator handles state management, retry logic, and logging for every step. Agents in this pattern are relatively simple: they receive input, perform their specific work, and return output without needing to know about other agents or coordination logic.
The advantages of orchestration include easier debugging since the orchestrator provides a single source of truth for execution state, ability to implement rollback and compensation for failures, centralized dashboards showing the entire system state, and greater control over execution flow. Orchestration is particularly valuable for workflows with complex dependencies requiring central management, situations requiring rollback capabilities, and when complete visibility into system state is more important than agent autonomy.
In the financial services context, the presenter notes that orchestration is used almost exclusively because the ability to debug and roll back matters more than agent autonomy. When something goes wrong with a credit decision, the business needs to know exactly which agent made what call, in what order, and with what data. Orchestration provides this visibility while choreography does not.
The decision framework for choosing between these patterns considers two axes: workflow complexity and autonomy requirements. Simple workflows with high autonomy requirements suit choreography. Complex workflows with low autonomy tolerance suit orchestration. The most challenging quadrant involves complex workflows requiring high autonomy, which may require hybrid patterns combining choreography with saga patterns for compensation.
The case study identifies state management as a critical challenge that often breaks multi-agent systems at scale. The common but problematic approach involves shared mutable state where multiple agents read and write to the same database records concurrently. For example, Agent A reads a credit score of 680, calculates a new value, and writes 750. Agent B simultaneously reads 680, calculates a different value, and writes 720. With last-write-wins semantics, Agent A’s update disappears, creating a lost update problem.
While modern databases provide protections like row locks and isolation levels, these mechanisms must be used correctly with explicit transactions, serializable isolation, and select-for-update statements. Many teams use default isolation levels without explicit locks and inadvertently ship race conditions to production. The presenter acknowledges making this exact mistake, assuming the database would handle concurrency automatically when in fact explicit handling was required.
The recommended solution involves immutable state snapshots with versioning. In this pattern, Agent A produces state version one, which is then sealed and immutable. State is stored in the orchestrator database as an append-only log using insert operations rather than updates. Agent A hands state version one to Agent B, which validates that the schema matches its expected data contract before processing. Agent B then produces state version two, also immutable, and inserts it as a new row rather than updating version one. When Agent B hands this to Agent C, the same process repeats with schema validation and version tracking at each handoff.
If Agent C fails, the system can roll back to version two. For debugging, developers can replay state evolution from version one through version N, seeing exactly what each agent received and produced. This approach eliminates race conditions by preventing concurrent modification of the same record, with each agent appending a new version instead of updating shared state.
The implementation uses frozen data classes in Python to ensure immutability, with each state object containing a version number, data payload, and creator information. The handoff function performs three critical operations: schema validation to enforce contracts between agents, version incrementing to create a new immutable state object, and execution of the next agent with that immutable state. Because agents cannot modify input state and can only produce new state, an entire class of bugs is prevented.
This immutability provides clear lineage where every state has a version and known creator. When problems occur, developers can trace back through state evolution. If version seven produces bad output, they can examine version six that went into the agent and version five before that, binary searching through state history to identify where things went wrong.
Beyond state management, the case study emphasizes data contracts as essential for reliable multi-agent systems. Agents cannot simply throw arbitrary data at each other and hope it works. Instead, they need explicit contracts defining expected inputs and outputs.
In the example provided, a research agent promises to output specific fields including findings, confidence score, sources, and timestamp. An analysis agent declares that it requires research agent output with specific types and validates the contract at runtime. If the confidence score is below 0.7, the analysis agent rejects the handoff. This contract enforcement catches quality issues at the boundary between agents rather than three agents downstream when garbage output appears in a final report.
The Databricks implementation registers these input-output schemas in Unity Catalog, ensuring every agent’s contract is versioned and governed in a central location, providing discoverability and governance across the organization.
The case study presents two critical patterns for handling inevitable failures in multi-agent systems.
Circuit Breaker Pattern comes directly from distributed systems engineering. When Agent A calls Agent B, the call is wrapped in a circuit breaker that tracks failure counts. If Agent B fails repeatedly (for example, five times in a row), the circuit breaker opens. Instead of waiting for timeouts on every subsequent call, the system fails fast, immediately returning an error without attempting the call. This prevents bombarding a failing agent with requests and protects the overall system from cascading failures.
After a timeout period (such as 60 seconds), the circuit transitions to half-open state, testing Agent B with a single request. If the request succeeds, the circuit closes and normal operation resumes. If it fails, the circuit reopens and resets the timer. This pattern prevents one agent’s failure from bringing down the entire workflow, enabling graceful degradation where the system might skip the failing agent and continue with reduced functionality, use cached results, or alert a human operator.
The implementation tracks failure count and state (open, closed, half-open), checking state before attempting calls. When the circuit is open, calls fail immediately without attempting execution. When closed, calls proceed normally with success resetting the failure count and failures incrementing it. Reaching the failure threshold opens the circuit, and after the timeout period, the system transitions to half-open for testing. On Databricks, circuit breaker policies are enforced at the serving layer through Model Serving or AI Gateway, with open-closed transitions logged in MLflow for monitoring when agents start experiencing problems.
Saga Pattern (Compensation Pattern) provides distributed transaction capabilities for multi-agent systems. Every agent implements two methods: execute (which performs the work) and compensate (which rolls it back). The orchestrator tracks which agents have executed successfully. If an agent fails mid-workflow, the orchestrator walks backward through the list of executed agents and calls compensate on each one in reverse order.
For example, if a research agent, analysis agent, and report agent execute successfully but a delivery agent fails, the orchestrator compensates the report agent (which deletes the draft report), then compensates the analysis agent (which deletes the draft recommendation), then compensates the research agent (which clears cached research data). This returns the system to its initial state with no partial transactions and no stuck workflows.
Every operation in this pattern must be reversible, with agents contractually obligated to implement compensation logic. While not glamorous, this pattern is essential for how production systems handle partial failures. Financial services use cases particularly require this capability to ensure transactions can be properly rolled back.
The case study presents a concrete production architecture implemented on the Databricks Data Intelligence Platform, bringing together all the patterns discussed.
The orchestration layer uses LangGraph wired into Mosaic AI Agent Framework, handling multi-agent orchestration by managing the workflow graph and determining which agents to call in what order. Each agent is implemented as a Unity Catalog function, written in SQL or Python, or as a model registered in Unity Catalog. Registering these assets in Unity Catalog makes them centrally discoverable within the organization, governed in one place, and versioned for production operations.
Agents are exposed through Databricks Model Serving or Function Serving, where circuit breaker policies like retries, timeouts, and rate limits are enforced at the serving layer via AI Gateway configuration. This centralizes reliability patterns without requiring each agent to implement them individually.
The data layer uses Delta Lake to store everything including state versions from agents, customer data, and all other workflow data. State snapshots are stored in Delta tables that are immutable and versioned, with state versions simply represented as rows in a Delta table that are never updated in place. Each agent run is tied to a state version via MLflow Traces, enabling developers to step through state evolution when debugging.
Unity Catalog governs everything including access control, lineage, and audit trails for both data and agents. This provides centralized governance ensuring compliance and security across all multi-agent operations.
MLflow provides per-agent tracing and evaluation capabilities with out-of-the-box LLM-as-judge metrics on every call, tracking latency, inputs, outputs, and token usage. This observability is critical for debugging and optimization.
The platform also includes Agent Bricks, a higher-level abstraction that packages common orchestration patterns for multi-agent use cases, reducing the need to rebuild standard patterns for each new system.
The operational flow involves the LangGraph orchestrator calling Agent A (a Unity Catalog function or model), receiving a result, and writing version one state to Delta Lake. It then calls Agent B with state version one, writes version two, and continues. MLflow traces every call with full observability. Circuit breakers at the serving layer guard each call. If Agent C fails, LangGraph triggers compensation logic, walking backward and calling compensate functions for previous successful steps.
The presenter notes that these patterns run in production day in and day out, handling workloads 24/7 across billions of transactions because the orchestrator serves as the single source of truth with all coordination happening through this central point.
The case study concludes with several philosophical points about building production multi-agent systems. First, agent chaos is inevitable when scaling past one agent. Coordination problems, race conditions, and cascading failures are guaranteed, and the complexity curve’s exponential nature is unavoidable.
Second, architecture is a choice. Teams can build systems with proper patterns including orchestration or choreography, immutable state, circuit breakers, compensation patterns, and data contracts. Understanding and implementing these patterns is essential for building systems rather than merely demos.
Third, the distinction between demos and production systems is critical. Demos are easy to build with an LLM and some prompts, and anyone can create something that looks impressive. However, these demos typically do not work in production. Production requires building systems, which is significantly harder but creates actual business value.
The patterns presented—choreography versus orchestration, immutable state management, circuit breakers, saga patterns—represent unsexy infrastructure work that does not generate applause. However, implementing these patterns makes systems reliable, preventing 2:00 AM failures that damage business operations and customer trust. The presenter emphasizes that reliability is what people notice over time, even if it is not as exciting as cutting-edge AI capabilities.
While this case study provides valuable technical patterns, it is important to note that it functions partially as promotional content for Databricks products. The presenter is a Databricks employee, and the architecture discussion heavily features Databricks-specific components like Unity Catalog, Delta Lake, MLflow, and Agent Bricks. While these tools may effectively implement the described patterns, the patterns themselves are not Databricks-specific and can be implemented using alternative technologies.
The circuit breaker pattern, saga pattern, immutable state management, and orchestration versus choreography concepts all originate from general distributed systems engineering and can be implemented with various technology stacks. Teams evaluating these patterns should consider whether Databricks is the right platform for their specific needs or whether other orchestration engines, databases, and observability tools might be more appropriate.
Additionally, while the credit decisioning system failure provides a compelling narrative, the case study would benefit from more detailed quantitative results. How much did implementing these patterns improve system reliability? What were the performance implications of immutable state versioning? What operational overhead is involved in maintaining compensation functions for every agent? These questions remain partially unanswered.
That said, the core architectural insights are sound and reflect established distributed systems engineering principles adapted thoughtfully for multi-agent AI contexts. The emphasis on treating multi-agent systems as distributed systems first and AI systems second represents an important perspective shift for many AI practitioners who may lack distributed systems experience. The specific failure modes described—cache invalidation race conditions, lost updates from concurrent writes, cascading failures—are realistic problems that production teams will encounter, and the presented solutions are architecturally sound approaches to addressing them.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.