## Overview
Clarus Care is a rapidly growing healthcare technology company that provides call management systems for medical practices, serving over 16,000 users across more than 40 medical specialties and processing 15 million patient calls annually with a 99% client retention rate. The company partnered with the AWS Generative AI Innovation Center (GenAIIC) to develop a production-ready, AI-powered conversational contact center that replaces traditional menu-driven IVR systems with natural language interactions. This case study demonstrates a sophisticated LLMOps implementation that addresses real-world constraints including strict latency requirements (under 3 seconds), high availability demands (99.99% SLA), multi-channel support, and the need for scalable customization across different healthcare facilities.
The solution represents a comprehensive production deployment of LLMs in a mission-critical healthcare communication environment, where the stakes are high—delayed or misrouted communications can impact patient care outcomes. The team needed to balance conversational quality with operational requirements including compliance, reliability, and performance at scale.
## Technical Architecture and Infrastructure
The production architecture is built on a foundation of AWS services designed for reliability and scalability. Amazon Connect serves as the core contact center platform, chosen specifically for its ability to maintain 99.99% availability while providing comprehensive capabilities across both voice and chat channels. Amazon Lex handles the transcription and conversation state management, acting as the interface layer between Connect and the backend AI logic. AWS Lambda functions orchestrate the conversation flow and execute the business logic, making calls to Amazon Bedrock for LLM-powered natural language understanding and generation.
The architecture follows a clear separation of concerns. Connect manages the communication channels and routing, Lex maintains session state and handles speech-to-text/text-to-speech conversion, Lambda provides the orchestration layer where conversation logic resides, and Bedrock serves as the AI inference layer. This modular design enables the team to update AI components without disrupting the broader contact center infrastructure, a critical consideration for maintaining high availability in production.
For the web interface, the solution uses Amazon CloudFront and Amazon S3 for static website hosting, demonstrating the system's multichannel capabilities. Patients can engage through either voice calls or a chat widget, with both interfaces using the same underlying contact flow logic (though the team notes this can be customized for chat-specific language patterns). An analytics pipeline processes conversation logs to provide performance insights, built using previously published reusable GenAI contact center assets. A customizable dashboard offers visualization of this data, making insights accessible to both technical and non-technical stakeholders.
## Multi-Model Strategy and Model Selection
One of the most sophisticated aspects of this LLMOps implementation is the strategic use of multiple foundation models through Amazon Bedrock, with task-specific model selection based on accuracy, cost, and latency requirements. This demonstrates a mature understanding that no single model is optimal for all tasks in a production system, and that different stages of a conversation flow have different performance requirements.
The team employs Anthropic's Claude 3.5 Sonnet for intent extraction, where the task demands high-quality natural language understanding capable of identifying multiple intents from a single patient utterance. This is the most complex NLU task in the system—patients might say something like "I need to refill my prescription and also ask about my bill," requiring the model to parse multiple distinct intents and maintain awareness of what constitutes a truly new request versus a clarification of an existing one.
For structured information collection during conversations, the system uses Amazon Nova Pro through Bedrock. This represents a balanced choice where the team needs reliable extraction of specific data fields (patient names, dates, times, provider names) while maintaining a conversational tone. Nova Pro offers faster inference than Claude 3.5 Sonnet while still providing strong natural language understanding capabilities.
For response generation, the solution employs Amazon Nova Lite, the smallest and fastest model in the Nova family. This is a strategic optimization—once the system knows what it needs to say (based on conversation state and collected information), generating the actual response text is a simpler task that doesn't require the full capabilities of larger models. Using Nova Lite here helps the system meet its aggressive sub-3-second latency requirement while maintaining natural and empathetic language.
The team also uses Nova Micro for simple binary classification tasks, such as determining whether a patient is confirming or declining a proposed appointment time. This demonstrates careful task decomposition and appropriate model sizing—using a massive model for simple yes/no classification would be wasteful and slow.
All model interactions leverage the Bedrock unified Converse API, which provides several production benefits. The consistent interface across different models simplifies development and maintenance, as the team doesn't need to manage different API patterns for each foundation model. Model version control through Bedrock facilitates stable behavior across deployments, a critical requirement for a system handling healthcare communications. The architecture is future-proof, allowing the team to adopt new models as they become available in Bedrock without major code changes—they can swap in improved models by updating configuration rather than rewriting conversation logic.
It's worth noting that while the blog post presents these model choices as successful, the actual performance comparison data isn't provided. We don't see metrics showing how much faster Nova Lite is compared to Claude 3.5 Sonnet for response generation, or whether the quality difference is noticeable to patients. This is a common pattern in vendor-produced case studies—the architectural decisions are presented as optimal without detailed benchmarking data that would allow independent assessment.
## Conversation Management and State Handling
The conversation management system demonstrates sophisticated orchestration of multiple LLM calls to create a cohesive patient experience. The flow begins with greeting and urgency assessment, where the system immediately evaluates whether the patient's situation requires urgent attention. This is accomplished through a focused prompt that analyzes the patient's initial statement against predefined urgent intent categories, returning either "urgent" or "non_urgent" to guide routing decisions. In healthcare contexts, getting this right is critical—missing an urgent case could have serious consequences.
A key innovation in this system is the ability to handle multiple intents within a single interaction. Rather than forcing patients through rigid menu trees where they must complete one task before moving to another, the system can understand when a patient mentions multiple needs in a single utterance. The team extracts both the intent and a quote from the user input that led to the extraction. This serves two important purposes: it provides integrated model reasoning to ensure correct intent identification, and it maintains conversation history references so the system doesn't repeatedly extract the same intent unless explicitly requested again.
Once intents are identified, they're added to a processing queue and handled sequentially. For each intent, the system enters an information collection loop consisting of two interdependent stages: checking for missing information fields and generating natural language prompts to ask for that information, then parsing user responses to extract collected fields and identify what's still missing. This loop continues until required information is gathered.
The system maintains comprehensive state through Amazon Lex session attributes, which serve as the conversation's memory. These attributes store everything from user-provided information and conversation history to detected intents and collected data fields. This state management is what enables the system to maintain context across multiple Bedrock API calls, creating natural dialogue flow where the AI appears to "remember" what was discussed earlier.
The solution includes smart handoff capabilities that analyze conversation context to determine when patients should speak with human providers. When a transfer is needed—either because the system detects urgency or the patient explicitly requests it—the conversation context and collected information are preserved, facilitating a seamless experience. The human who picks up the call can see what the patient has already communicated, avoiding the frustrating experience of having to repeat information.
One challenge the team tackled was transcription and understanding of patient names, which is notoriously difficult in healthcare due to spelling variations, nicknames, and multiple providers with similar names. The solution automatically matches provider names mentioned by patients to the correct provider in the system, handling variations like "Dr. Smith" matching to "Dr. Jennifer Smith" or "Jenny Smith." This removes the rigid name matching or extension requirements of traditional IVR systems, though the case study doesn't detail exactly how this matching works—whether it uses fuzzy string matching, embeddings-based similarity, or some other approach.
## Intent Management and Hierarchical Service Model
The intent management system uses a hierarchical service model structure that reflects how patients naturally express their needs. This model organizes intents into three primary levels: urgency level (separating urgent from non-urgent services for appropriate handling), service level (grouping related services like appointments, prescriptions, and billing into logical categories), and provider-specific level (accommodating provider-specific requirements and sub-services).
This hierarchical structure enables efficient navigation through possible intents while maintaining flexibility for customization across different healthcare facilities. Each intent in the model includes custom instructions that can be dynamically injected into Bedrock prompts, allowing for highly configurable behavior without code changes. This is a practical production consideration—Clarus serves many different healthcare practices with varying workflows, and building a system that can be customized through configuration rather than code changes makes it much more maintainable and scalable.
The intent extraction process uses a comprehensive prompt that instructs the model (Claude 3.5 Sonnet) to identify all intents present in a patient's natural language input. The prompt includes detailed instructions about what constitutes a new intent, the complete list of possible intents, and formatting requirements for the response. Importantly, the system doesn't force classification into a single intent—it's designed to detect multiple needs expressed simultaneously, which aligns with how people actually communicate in healthcare contexts.
One of the practical challenges the team mentions is formatting the contact center call flow and service model in a way that is interchangeable for different customers with minimal code and configuration changes. The hierarchical intent model with dynamic prompt injection appears to be their solution to this challenge, though the case study doesn't provide extensive detail about how different facilities customize their deployments or what the typical customization process looks like.
## Appointment Scheduling Implementation
The scheduling component represents a sophisticated application of LLMs in a structured task with real-world constraints. Rather than being part of the main conversation handler, scheduling is implemented as a separate, purpose-built module that operates as a state machine with distinct conversation states and transitions.
The flow begins in an initial state where the system mentions office hours and asks for scheduling preferences before moving to a GATHERING_PREFERENCES state. In this state, the system uses an LLM (Nova Lite, chosen for low latency and preference understanding) to extract time preferences from the conversation. The extraction prompt is quite detailed, asking the model to output both reasoning (explaining what type of preferences were expressed, how relative dates were interpreted, why preferences were structured as they were, and what assumptions were made) and structured JSON output with specific or range-based preferences.
The system handles multiple types of preferences. A patient might request a specific time ("next Tuesday at 2pm") or a range ("sometime next week in the afternoon"). The prompt converts relative dates to specific dates, interprets time keywords (morning means 09:00-12:00, afternoon means 12:00-17:00), and handles constraints like "morning before 11" by converting it to 09:00-11:00. If no end time is specified, the system assumes a 30-minute appointment duration.
Once preferences are extracted, the system checks them against an existing scheduling database. There are three possible outcomes: if a specific requested time is available, the system presents it for confirmation and moves to a CONFIRMATION state; if the patient expressed a range preference, the system finds the earliest available time in that range and presents it for confirmation; if there's no availability matching the request, the system finds alternative times (within plus or minus one day of the requested time) and presents available blocks while asking for the patient's preference, staying in GATHERING_PREFERENCES and incrementing an attempt counter.
In the CONFIRMATION state, the system uses a simple yes/no classification (with Nova Micro for very low latency) to determine if the patient is confirming or declining the proposed time. If confirmed, the appointment is booked and a confirmation message is sent. If declined, the system asks for new preferences and returns to GATHERING_PREFERENCES.
The system tracks a maximum number of attempts (default is 3), and when this threshold is reached, it apologizes and escalates to office staff rather than continuing to loop. This is an important safety mechanism—the system recognizes when it's not making progress and hands off to humans rather than frustrating patients with an endless loop.
The scheduling flow uses three main LLM calls: extracting time preferences (Nova Lite), determining confirmation/denial (Nova Micro), and generating natural responses based on the current state and next step (Nova Lite). Each response generation call receives the conversation history and a specific next-step prompt that describes what the system needs to communicate, such as asking for initial preferences, confirming a time, showing alternatives, asking for new preferences, escalating to staff, or providing booking confirmation.
This scheduling implementation demonstrates thoughtful state machine design combined with LLMs for flexible natural language interaction. The structured state management ensures the conversation makes progress toward a goal (booking an appointment), while the LLM components provide natural interaction that can handle the varied ways patients express scheduling preferences. The fallback to human escalation shows appropriate recognition of the system's limitations.
## Prompt Engineering and Task Decomposition
The case study provides several examples of the prompts used throughout the system, revealing a sophisticated approach to prompt engineering for production LLM applications. The prompts generally follow a pattern of providing clear instructions, specifying output format requirements (often with XML tags for structure), including relevant guidelines and constraints, and sometimes providing examples.
For intent extraction, the prompt instructs Claude 3.5 Sonnet to identify multiple intents, includes comprehensive instructions about what constitutes a new intent, provides the complete list of possible intents, and specifies formatting requirements. The system extracts both the intent label and the specific quote from user input that triggered the extraction, enabling the model to show its reasoning and preventing duplicate intent extraction.
For information collection, the prompts guide models to check for missing information fields, generate natural language requests for missing information, parse user utterances to extract provided information, maintain conversational tone and empathy, ask only for specific missing information (not re-asking for what's already provided), acknowledge information already given, and handle special cases like spelling out names.
The scheduling preference extraction prompt is particularly detailed, asking for structured reasoning followed by JSON output, providing specific format requirements for different preference types (specific slots versus date ranges), giving conversion rules for relative dates and time descriptions, specifying constraints like office hours (weekdays 9-5), and handling cases where end times aren't specified.
The confirmation detection prompt is intentionally simple, asking Nova Micro to determine if the user is confirming or declining with a simple true/false output wrapped in XML tags. This demonstrates appropriate task sizing—not every task requires a complex prompt or large model.
For response generation throughout the system, the prompts provide conversation history and specific instructions for what needs to be communicated in the current context. The system uses templates for common next steps (asking for initial preferences, confirming times, showing alternatives, requesting new preferences, escalating to staff, providing booking confirmation), with placeholders filled in with specific details like provider names, times, and office hours.
One notable aspect of the prompt engineering approach is the emphasis on output formatting with XML tags. This provides clear structure for parsing model outputs programmatically, though it's worth noting that the case study doesn't discuss error handling—what happens when a model fails to follow the specified format, or how often this occurs in production.
## Latency Optimization and Performance Requirements
Meeting the strict sub-3-second latency requirement represents one of the key LLMOps challenges in this implementation. In conversational AI, latency is critical to user experience—delays of more than a few seconds feel unnatural and frustrating. The team employed several strategies to achieve acceptable performance.
The multi-model approach is fundamentally a latency optimization strategy. By using faster, smaller models (Nova Lite and Nova Micro) for simpler tasks like response generation and binary classification, the system avoids the overhead of calling larger models when they're not needed. Only the most complex task—multi-intent extraction from natural language—uses the larger Claude 3.5 Sonnet model.
The architecture's use of Amazon Lex for transcription and state management also contributes to latency management. Lex handles the real-time speech-to-text conversion close to the edge, so the Lambda functions and Bedrock calls are working with text rather than audio processing, which would be slower.
The case study mentions managing latency requirements as one of the challenges tackled during development, but it doesn't provide detailed metrics showing actual latency distributions in production. We don't see P50, P95, or P99 latency numbers, or breakdowns of where time is spent in the conversation flow. This makes it difficult to assess how consistently the system meets the sub-3-second target or how much headroom exists.
The Bedrock Converse API's consistent interface likely helps with latency predictability by abstracting away model-specific implementation details. However, the case study doesn't discuss other common latency optimization techniques like prompt caching, batch processing of multiple model calls where possible, or streaming responses for longer-form generations.
## Scalability and Multi-Tenancy Considerations
Clarus serves over 16,000 users across 40+ medical specialties and handles 15 million patient calls annually, with the expectation of continued growth. The solution needs to support multiple healthcare facilities with different workflows, providers, and service offerings—essentially a multi-tenant architecture where each tenant has customized intent hierarchies and business logic.
The hierarchical intent model with dynamically injected custom instructions per intent appears to be the primary mechanism for tenant customization. Each healthcare facility can have its own service model configuration that defines available intents, required information fields for each intent, urgent intent categories, provider lists, and custom instructions for handling specific intents. This configuration-driven approach means new facilities can be onboarded without code changes, though the case study doesn't detail what the configuration format looks like or how it's managed.
The infrastructure uses managed AWS services (Connect, Lex, Lambda, Bedrock) that provide automatic scaling, which should help the system handle growing call volumes. However, the case study doesn't discuss capacity planning, cost management at scale, or how the system handles traffic spikes (for example, when appointment slots open up or after a facility sends out patient communications).
The 99.99% availability SLA requirement is mentioned as a key factor in selecting Amazon Connect as the core platform. This translates to less than an hour of downtime per year, which is appropriate for healthcare communications where reliability is critical. The multi-region, highly available architecture of the managed AWS services presumably helps achieve this, though the case study doesn't discuss specific reliability engineering practices like circuit breakers, fallbacks when Bedrock is unavailable, or how the system degrades gracefully under failure conditions.
## Analytics and Monitoring
The solution incorporates an analytics pipeline that processes conversation logs to provide insights into system performance and patient interactions. This is built using previously published reusable GenAI contact center assets, suggesting the team leveraged existing frameworks rather than building from scratch.
A customizable dashboard provides visualization of this data, making insights accessible to both technical and non-technical staff. The case study doesn't specify what metrics are tracked, but typical contact center analytics might include call volumes, intent distribution, resolution rates, transfer rates, average handling time, patient satisfaction scores, and conversation success metrics.
For a production LLM system, additional metrics would be valuable: intent classification accuracy, information extraction accuracy, false urgency detections, inappropriate transfers, conversation abandonment rates, and latency distributions. The case study doesn't detail whether these are tracked or how the team measures the quality of LLM-generated responses over time.
Monitoring and observability are critical for LLMOps but aren't extensively covered in this case study. We don't see discussion of how the team detects when models are producing poor-quality responses, how they identify conversation flows that lead to patient frustration, or what alerting is in place for system degradation.
## Safety, Compliance, and Guardrails
Healthcare is a highly regulated domain with strict requirements around patient privacy (HIPAA in the US), data handling, and communication quality. Surprisingly, the case study provides limited detail on how these concerns are addressed in the production system.
The case study mentions that Amazon Bedrock has services that "help with scaling the solution and deploying it to production," specifically calling out Bedrock Guardrails for implementing content and PII/PHI safeguards. However, there's no detail about what guardrails are actually implemented, how they're configured, or what happens when a guardrail is triggered during a patient conversation.
Patient health information (PHI) handling is particularly sensitive. The system is presumably collecting and storing patient names, dates of birth, medical conditions, prescription information, and billing details through conversations. The case study doesn't discuss where this data is stored, how long it's retained, how it's encrypted, or how access is controlled.
The smart transfer capability for urgent cases is a safety feature—the system recognizes certain situations require human intervention. However, we don't see discussion of false negative rates (urgent cases that aren't flagged) or false positive rates (non-urgent cases incorrectly flagged as urgent), both of which would be important metrics for a healthcare application.
The case study also doesn't address conversation quality issues that could arise with LLM-based systems: hallucinations (making up information about appointments or prescriptions), inconsistent information across multiple turns, or inappropriate responses to sensitive patient disclosures. These are known challenges in production LLM applications, and understanding how the team mitigates them would strengthen confidence in the system.
## Future Extensions and Roadmap
The case study mentions that Clarus can integrate the contact center's voicebot with Amazon Nova Sonic in the future. Nova Sonic is described as a speech-to-speech LLM that delivers real-time, human-like voice conversations with low latency, now directly integrated with Connect. This suggests the current implementation still uses a text-based pipeline (speech-to-text, text processing, text-to-speech) rather than end-to-end speech processing.
Moving to speech-to-speech models could potentially improve latency by eliminating intermediate transcription steps and produce more natural-sounding conversations with better prosody. However, it would also represent a significant architectural change and might complicate the multi-model strategy—it's less clear how you would combine a speech-to-speech model for some tasks with text-based models for others.
The case study also mentions that Bedrock supports Retrieval Augmented Generation (RAG) through Bedrock Knowledge Bases and structured data retrieval, suggesting this could be a future enhancement. RAG could be valuable for answering patient questions about specific medical information, insurance coverage, or facility policies by grounding responses in authoritative documents rather than relying solely on model knowledge.
Bedrock Evaluations for automated and human-based conversation evaluation is mentioned as another capability that could support scaling to production. This suggests the current implementation may not have sophisticated automated evaluation in place, or at least that this is an area for future enhancement. Automated evaluation would be valuable for catching quality regressions when updating models or prompts.
## Critical Assessment and Balanced Perspective
This case study demonstrates a sophisticated production LLMOps implementation with thoughtful engineering decisions around multi-model selection, conversation state management, and task decomposition. The system addresses real complexity in healthcare communication and appears to achieve important operational requirements like high availability and low latency.
However, as with many vendor-produced case studies, there are important limitations to what we can assess. The case study provides architectural details and design rationale but lacks quantitative performance data. We don't see metrics on actual production performance: What percentage of calls are successfully resolved without human transfer? How accurate is the intent extraction? What's the patient satisfaction score? What are the actual latency distributions? How often do models produce incorrectly formatted outputs that require fallback handling?
The cost implications of running multiple LLM calls per conversation at scale (15 million calls annually) aren't discussed. Even with optimized model selection, the inference costs could be substantial, and understanding the unit economics would help assess whether this approach is sustainable as the company grows.
The multi-model strategy is presented as optimal, but we don't have detailed comparison data. How much worse would response quality be if everything used Nova Lite? How much faster is Nova Lite than Claude 3.5 Sonnet for response generation? These quantitative comparisons would strengthen confidence in the design choices.
The safety and compliance discussion is notably thin for a healthcare application. While Bedrock Guardrails are mentioned as available, we don't know what specific safeguards are implemented or how effective they are. The absence of discussion about model failure modes, hallucination mitigation, or PHI handling is concerning given the healthcare context.
The configuration-driven multi-tenancy approach sounds promising for scaling to many healthcare facilities, but we lack detail about what the configuration experience looks like, how complex it is to onboard a new facility, or what limitations exist in terms of customization.
That said, the core architecture demonstrates solid LLMOps principles: appropriate task decomposition, strategic model selection based on requirements, comprehensive state management, graceful degradation with human handoff, and use of managed services for reliability and scalability. The appointment scheduling implementation in particular shows thoughtful state machine design combined with LLM flexibility. The conversation management approach of extracting multiple intents, queuing them, and processing sequentially while maintaining context is more sophisticated than simpler single-intent systems.
This case study illustrates that production LLM applications in regulated, high-stakes domains like healthcare require careful engineering beyond just prompt writing. The team had to balance conversational quality with latency, reliability, compliance, cost, and customization requirements—all while building a system that can scale to millions of interactions. While we'd benefit from more detailed performance data and safety discussions, the architectural patterns demonstrated here provide valuable reference points for teams building similar production LLM systems.