Alan, a healthcare company supporting 1 million members, built AI agents to help members navigate complex healthcare questions and processes. The company transitioned from traditional workflows to playbook-based agent architectures, implementing a multi-agent system with classification and specialized agents (particularly for claims handling) that uses a ReAct loop for tool calling. The solution achieved 30-35% automation of customer service questions with quality comparable to human care experts, with 60% of reimbursements processed in under 5 minutes. Critical to their success was building custom orchestration frameworks and extensive internal tooling that empowered domain experts (customer service operators) to configure, debug, and maintain agents without engineering bottlenecks.
Alan is a European healthcare company that started its generative AI journey in 2023 when large language models emerged. The company supports approximately 1 million members and aims to reach 10 million by 2030. Alan’s mission centers on leveraging cutting-edge technology to help members navigate the often complex healthcare system. When GenAI emerged in 2023, it represented a turning point for the company—they made the strategic decision to “go all in” and fundamentally reexamine every decision process and product through the lens of what AI could enable. The presentation, delivered by multiple speakers including Gab and Alex (leading engineering), represents learnings from a three-year GenAI journey as of 2025.
The core use case presented focuses on customer service automation through AI agents. Healthcare questions are inherently complex—not simple password resets but nuanced queries covering personal topics like reimbursements, coverage details, and care navigation that demand accurate, empathetic responses. Members need instant, reliable guidance at scale, which traditional approaches couldn’t deliver efficiently. The solution they built now handles 30-35% of member questions with quality comparable to human care experts.
One of the most significant technical learnings Alan shared was their architectural shift from workflows to playbooks. Initially, they started with highly deterministic workflow-based approaches where every agent followed strict rules. This seemed like a solid starting point when beginning with AI and worked reasonably well with older models. However, as models evolved tremendously over their journey, they found that playbooks offered a real advantage to harness the full power of AI.
They explicitly noted the ongoing industry debate between workflows and playbooks, mentioning that OpenAI’s Agent Builder is based entirely on workflows. Alan’s position is that playbooks represent the best approach, though they still include deterministic incursions (deterministic controls) blended within playbooks where appropriate. This hybrid approach allows for flexibility while maintaining necessary guardrails. The team reconsiders this architectural decision periodically as they learn and as the AI landscape evolves.
Another critical decision point was whether to use off-the-shelf agent orchestrators like LangGraph or Pydantic AI versus building their own framework. Alan ultimately decided to build custom orchestration to move faster, fill their specific needs, and learn as quickly as possible. This wasn’t a permanent decision—they reconsider it every six months to evaluate whether the evolution of off-the-shelf frameworks might allow them to move faster or if they should continue iterating on their own framework. This reveals a pragmatic approach to LLMOps: making decisions based on current needs and team velocity while remaining open to switching as the ecosystem matures.
The production system uses a multi-agent architecture with two primary layers:
Classification Agent: The first agent receives member questions and has a clear objective—determine if there’s enough detail to hand off to a specialized agent. This acts as a routing mechanism, analyzing the conversation history and classifying the intent.
Specialized Expert Agents: Once classified, questions route to specialized agents with dedicated playbooks and tool sets. The demonstration focused on their “claims agent” which handles reimbursement-related questions. Each expert agent is scoped to a particular domain of knowledge, allowing for focused optimization and clearer ownership.
This architecture started small with just one agent and grew incrementally. The modular design made it easy to communicate with stakeholders about which portion of customer contacts they were automating. Once they achieved satisfactory results on one topic area, they moved to another, progressively expanding coverage to reach their 30-35% automation rate.
The specialized agents employ a ReAct (Reason and Act) loop architecture. For each expert agent, the flow follows this pattern:
In the demo scenario shown, a member asked about reimbursement for multiple visits including one for their child. The claims agent performed approximately 10 successive tool calls, querying backend systems for care events related to the member and their child, then synthesized this raw data (JSONs, lists, rough structured data) into a clear, understandable natural language response. The agent also provided smart redirection—sending members directly to the appropriate part of the application to complete their request, such as uploading documents.
Alan shared important learnings about tool calling as they scaled the number of tools available to agents:
Parameter Minimization: As they added more tools, agents began to struggle with making the right calls with correct arguments. A key best practice was removing as many parameters as possible from function calls. The example given was the challenge of providing UUIDs correctly—simplifying parameter requirements improved reliability.
Tool Combination: When tools were frequently used together, they combined them to reduce the decision complexity for the agent.
Specification and Error Handling: They specify parameters as precisely as possible and provide robust error handling so agents can learn when they’ve called a tool with incorrect arguments.
Model Improvements: Comparing their current system to six months prior (around mid-2024 to early 2025), they observed that models have become significantly more efficient and reliable at tool calling. This improvement aligns with the industry trend toward MCPs (Model Context Protocol) and giving agents access to more tools.
A critical success factor that Alan emphasized repeatedly was building internal tooling that enabled domain experts—specifically customer service operators—to configure, debug, and maintain agents without creating engineering bottlenecks. This represents a mature LLMOps perspective: AI systems in production require continuous maintenance and iteration, and the people with domain knowledge must be able to contribute directly.
Debug Tool: The first tool demonstrated allows customer service operators to answer “What is wrong with my agent? Why did it answer this way?” The interface shows:
This transparency is essential for debugging, understanding agent behavior, and identifying opportunities for improvement. Customer service operators can examine 100-200 conversations, understand strengths and weaknesses, and then move to improvement.
Agent Configuration Tool: This is described as a “CI-like system” that allows tracing different changes made to agents and enables customer service operators to test changes in a safe environment before pushing to production. Features include:
The team explicitly noted that all this internal tooling was built entirely by engineers without designer involvement—a humorous acknowledgment that prioritized functionality over polished UX, though they were open to feedback.
While not exhaustively detailed, evaluation appears deeply integrated into their workflow:
Alan was refreshingly transparent about experiments that didn’t yet make it to production. They explored a more complex orchestrator-manager design to handle multi-topic questions (where a member asks about multiple unrelated topics in one conversation). Their current classification-then-specialist approach doesn’t handle this scenario well. The experimentation with a manager-orchestrator architecture that could call different agents solved the technical challenge successfully, but introduced significant complexity in tooling management and evaluation. When they analyzed the impact, they found this scenario only represented 4-5% of conversations. The team decided the added complexity wasn’t justified for that small percentage—a pragmatic example of choosing not to deploy a technically working solution because the operational overhead outweighed the benefit.
Beyond customer service automation (30-35% of questions), Alan shared broader AI impacts:
The company emphasized that AI is now a “natural extension” for their teams, embedded in every layer of decision-making, daily operations, and services.
Alan’s presentation concluded with three main takeaways that encapsulate their LLMOps philosophy:
Problem-First Approach: Focus on what problem you’re solving with AI agents. There’s significant discussion about using AI for various applications, but without a real problem, the solution doesn’t matter. This critique of “AI for AI’s sake” shows maturity in their deployment strategy.
Team Effort: Success requires combining AI experts, domain experts, and robust evaluation working together. They invested heavily in internal tooling specifically to ensure engineers and data scientists wouldn’t become bottlenecks or spend all their time on prompt engineering. Domain experts must be empowered to directly debug and configure agents for the solution to scale.
Investment in Tooling: Both third-party solutions and custom-built internal tools are essential. Building their own tooling allowed them to learn as fast as possible, though they remain open to mature external solutions as the ecosystem develops.
Several aspects of Alan’s case study deserve balanced consideration:
Strengths: The case demonstrates mature LLMOps thinking—pragmatic architectural decisions, empowerment of non-technical domain experts, iterative experimentation with willingness to not deploy complex solutions, and continuous reevaluation of technical choices. The transparency about their three-year journey including mistakes and architectural pivots is valuable.
Limitations and Unknowns: The presentation doesn’t deeply detail their evaluation methodologies beyond mentioning offline evaluation capabilities. The claim of “quality comparable to care experts” for 30-35% of questions needs more context—what quality metrics, how measured, what types of questions does the 30-35% represent (likely simpler queries)? The privacy/security implications of LLMs accessing sensitive health data are mentioned only in passing. The custom orchestration decision, while justified, creates maintenance burden and potential technical debt that may not pay off long-term as frameworks like LangGraph mature. The “always supervised by doctors” note for medical chatbots suggests human-in-the-loop requirements that may limit the scalability claims.
Vendor Positioning: While Alan states they “didn’t start as an AI company” and “are not selling AI,” this presentation serves recruiting purposes (“we’re hiring a lot”) and positions the company as an AI leader in healthcare. The achievements are presented somewhat selectively—emphasizing successes while treating challenges as learning experiences.
Overall, this represents a substantive case study of production AI agent deployment in a regulated, high-stakes industry with genuine complexity in the problem domain and thoughtful approaches to LLMOps challenges around tooling, evaluation, and scaling.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.