Healthcare
Memorial Sloan Kettering / McLeod Health / UCLA
Company
Memorial Sloan Kettering / McLeod Health / UCLA
Title
Enterprise-Scale Deployment of AI Ambient Scribes Across Multiple Healthcare Systems
Industry
Healthcare
Year
2025
Summary (short)
This panel discussion features three major healthcare systems—McLeod Health, Memorial Sloan Kettering Cancer Center, and UCLA Health—discussing their experiences deploying generative AI-powered ambient clinical documentation (AI scribes) at scale. The organizations faced challenges in vendor evaluation, clinician adoption, and demonstrating ROI while addressing physician burnout and documentation burden. Through rigorous evaluation processes including randomized controlled trials, head-to-head vendor comparisons, and structured pilots, these systems successfully deployed AI scribes to hundreds to thousands of physicians. Results included significant reductions in burnout (20% at UCLA), improved patient satisfaction scores (5-6% increases at McLeod), time savings of 1.5-2 hours per day, and positive financial ROI through improved coding and RVU capture. Key learnings emphasized the importance of robust training, encounter-based pricing models, workflow integration, and managing expectations that AI scribes are not a universal solution for all specialties and clinicians.
## Overview This case study presents a comprehensive view of deploying generative AI-powered ambient clinical documentation systems (commonly called "AI scribes") across three major healthcare organizations: McLeod Health (a 7-hospital system in South Carolina), Memorial Sloan Kettering Cancer Center (MSK) in New York, and UCLA Health (with approximately 5,000 ambulatory physicians). The panel discussion, hosted by healthcare technology research company Elian, provides deep operational insights into the complete lifecycle of LLM deployment in production healthcare settings—from vendor evaluation and pilot design through full-scale rollout and measurement of clinical and financial outcomes. The organizations deployed different vendors: McLeod Health selected Suki, MSK chose Abridge, and UCLA Health implemented Nabla. This diversity provides valuable comparative insights into LLMOps practices across different platforms and organizational contexts. The panel reveals both the technical and operational complexities of deploying LLM-based systems where accuracy, reliability, and integration into clinical workflows are mission-critical. ## Vendor Evaluation and Model Selection McLeod Health employed a particularly rigorous and innovative evaluation methodology designed explicitly to reduce cognitive bias. Brian Frost, their Chief Medical Information Officer, described a multi-phase evaluation process conducted approximately one year prior to the discussion. They filtered vendors based on data security and scalability concerns, narrowing to four top candidates. The evaluation involved creating 15 detailed patient encounter scripts performed by professional actors and three physicians from different specialties (primary care, cardiology, and vascular surgery). These scripts intentionally tested edge cases including various accents (notably Southern dialects which proved particularly challenging for the models), patient interruptions, difficult patient behaviors, and clinically complex scenarios. The vendors were required to process these interactions in real-time and submit unedited notes immediately following each encounter. Three separate evaluation groups—physicians, revenue cycle staff, and patients—reviewed the notes for readability, coding quality, and clinical relevance. This multi-stakeholder approach to model evaluation represents a sophisticated LLMOps practice that goes beyond simple accuracy metrics to encompass usability, clinical utility, and business value. The top two vendors from this phase were then invited back for a second evaluation phase focused on Epic EHR integration and workflow impact with a broader physician audience. Importantly, the vendor ultimately selected (Suki) was one that the evaluation lead had initially been skeptical about, demonstrating the value of structured, bias-reducing evaluation processes. UCLA Health took a different but equally rigorous approach by conducting a randomized controlled trial (RCT) comparing two vendors head-to-head. Working with their Values and Analytics Solution Group, they designed a gold-standard clinical trial with three groups: two intervention groups (each using a different AI scribe) and a control group that initially had no access to the technology. This methodologically sophisticated approach to pilot design represents best practices in LLMOps evaluation, treating the deployment as a scientific experiment rather than simply a technology rollout. The pilot ran from fall through March and included approximately 200 physicians. Memorial Sloan Kettering conducted sequential head-to-head pilots, starting with one vendor for approximately three months, then introducing a second vendor while allowing clinicians access to both simultaneously for comparison. A significant challenge for MSK was conducting these pilots during their Epic EHR implementation, meaning clinicians initially used the AI scribes without EHR integration—a major barrier to adoption but one that provided insights into the core model performance independent of integration factors. Despite this limitation, MSK received strong engagement from clinicians and vendors, with monthly product releases incorporating user feedback demonstrating agile development practices. ## Evaluation Metrics and Methodology The evaluation frameworks employed across these organizations represent sophisticated approaches to measuring LLM performance in production healthcare settings. UCLA Health utilized validated psychometric instruments including the Mini Z 2.0 survey for physician burnout, the Professional Fulfillment Index (specifically the work exhaustion subscale), and the NASA Task Load Index adapted for healthcare to measure EHR-related stress. These validated instruments were administered pre-pilot and post-pilot to capture quantitative changes in clinician experience. Even within the relatively short pilot timeframe, UCLA observed measurable improvements in burnout metrics and work exhaustion across both intervention groups compared to controls. McLeod Health focused heavily on operational KPIs including "pajama time" (time spent on documentation outside clinical hours), weekend documentation time, and coding level shifts. They also unexpectedly discovered significant impacts on patient satisfaction, with NRC survey scores showing increases of 6.3% for "provider listened carefully" and 5.9% for "trust the provider with your care" among physicians using the AI scribe. This finding suggests that the technology enabled behavioral changes—physicians making more eye contact and engaging in better shared decision-making—that improved the patient experience beyond simply reducing documentation burden. MSK developed a comprehensive evaluation plan incorporating both qualitative and quantitative data collection methodologies. Their metrics framework included financial impact measures (work RVUs, time to document, average level of service), clinician burden and burnout assessments (Mini Z, NASA TLX), clinician experience metrics, EHR time, technical model monitoring including hallucination tracking, and detailed utilization and adoption metrics. They specifically defined utilization numerators and denominators customized for different clinical settings (urgent care, inpatient, outpatient) and specialties. This represents a mature approach to LLMOps monitoring that considers both model performance and operational impact across diverse use cases. A critical aspect of evaluation across all organizations was the recognition that adoption metrics don't simply measure usage frequency but reflect genuine workflow integration. Initial targets had to be adjusted based on real-world usage patterns, with MSK specifically moving away from rigid utilization targets after observing how different clinician types and specialties naturally incorporated the technology into their workflows. ## Technical Performance and Model Challenges The panel discussion revealed significant technical challenges related to LLM performance in specialized clinical contexts. MSK, being an oncology-focused cancer center, found particularly mixed results regarding model accuracy and utility for complex oncology documentation. Clinicians reported concerns about insufficient detail in documentation of treatment risks and benefits, associated toxicities, and mitigation strategies—critical elements of oncology care that the general-purpose models struggled to capture with appropriate nuance and specificity. One particularly illustrative example was radiation oncologists finding that the models incorrectly transcribed clinical tumor staging, which involves specific combinations of numbers and letters that are critical to treatment planning. The experience varied significantly not just across specialties but even within the same service line, with some hematologists finding the output 80% usable while others rated it only 20% usable. This variability in perceived model quality highlights a fundamental challenge in LLMOps: the same underlying model architecture can perform very differently depending on the specific use case, user expectations, and the complexity and specificity of the domain language. MSK leadership acknowledged that all models in this space still require refinement for oncology applications and emphasized the importance of partnering with vendors willing to invest in specialty-specific improvements. The concept of "human in the loop" was universally emphasized as essential given current model limitations. All organizations stressed that clinicians must review and edit AI-generated notes, as models can produce omissions, inaccuracies, and other errors. This represents a critical LLMOps principle: deploying LLMs in high-stakes healthcare settings requires maintaining human oversight and final accountability. The training programs all organizations developed specifically addressed recognizing and correcting model errors, treating this as a core competency for users rather than an unfortunate limitation. Several technical observations emerged about model performance across different scenarios. Southern accents proved particularly challenging for speech recognition components. Interruptions and complex multi-party conversations (common in clinical encounters) tested the models' ability to maintain context and attribute statements correctly. Models initially struggled with situations involving physician behavior that deviated from expected norms (such as the deliberately dismissive surgeon in McLeod's evaluation scripts), demonstrating that training data likely emphasized more standard professional interactions. ## Deployment Architecture and Integration EHR integration emerged as absolutely critical to successful deployment. MSK's experience piloting without Epic integration during their EHR transition demonstrated that requiring clinicians to use separate, unintegrated systems creates significant adoption barriers even when the core model performance is strong. All panelists emphasized that "extra clicks"—even one or two—generate clinician complaints and reduce adoption. Seamless workflow integration isn't merely convenient; it's essential for production deployment. The preferred integration approach across organizations was deep Epic Haiku (mobile EHR) integration, allowing clinicians to initiate recordings, access notes, and complete documentation within their existing EHR workflow. However, McLeod Health also emphasized the importance of maintaining standalone app functionality for business continuity purposes. They noted that during EHR downtime events (which they framed as "when, not if"), organizations lose the ability to document if they're entirely dependent on EHR-integrated functionality. This represents thoughtful LLMOps architecture that considers failure modes and maintains operational resilience. The technical architecture also needed to accommodate different device preferences and workflows. Some clinicians preferred using iPads with Epic Canto, others used desktop workstations, and others primarily worked from mobile devices. The deployed solutions needed to function across this heterogeneous technical environment while maintaining consistent performance and user experience. McLeod Health's shift to encounter-based pricing rather than per-user-per-month licensing represented a significant operational and technical architecture decision. This pricing model aligned vendor incentives with actual usage and scaled costs more appropriately for clinicians with variable practice patterns (such as OB-GYNs and oncologists who might only use the tool for specific visit types). From an LLMOps perspective, encounter-based pricing requires robust usage tracking and billing integration but eliminates the operational overhead of license management and reduces risk for organizations piloting the technology. ## Training and Change Management All organizations emphasized that training couldn't be treated as a one-time onboarding event but required ongoing support and skill development. UCLA Health made training mandatory, incorporating not just tool functionality but broader education about AI in healthcare, LLM limitations, and the importance of reviewing generated content for errors. They specifically implemented "Advanced Features with Super Users" sessions where experienced users demonstrated their workflows and customizations to colleagues, leveraging peer learning as a change management strategy. The training emphasized that effective use of AI scribes requires behavioral adaptation from clinicians. Users needed to learn how to structure their encounters and verbal communication to optimize model performance, particularly by providing concise summaries at the end of encounters that the AI could parse effectively for assessment and plan sections. McLeod Health found that this adaptation period typically required 2 weeks or 100 encounters before clinicians felt comfortable and saw the full benefits, and they actively discouraged early abandonment, asking clinicians to commit to this learning period before concluding the technology didn't work for them. Vendor-provided "elbow-to-elbow" support proved valuable, with vendors setting up on-site presence in clinical locations to provide in-context assistance to physicians in their actual work environments. This hands-on support model recognizes that clinical workflows are complex and situationally dependent, making generic training less effective than contextualized assistance. The change management approach also required careful stakeholder communication. McLeod Health's CEO explicitly told physicians the organization was not implementing AI scribes to increase patient volume but to reduce burnout and improve work-life balance. This messaging was critical to physician buy-in and represented thoughtful change management that aligned organizational goals with clinician values. The clinical informatics team worked directly in clinical settings rather than from offices, observing actual workflows and providing situated support. ## Adoption Patterns and User Segmentation A consistent finding across organizations was that adoption patterns defied initial predictions. Champions and early adopters couldn't be reliably identified in advance based on perceived tech-savviness or enthusiasm. Some physicians expected to embrace the technology resisted it, while others initially skeptical became the strongest advocates. This unpredictability has important implications for LLMOps rollout strategies—organizations can't simply target "tech-forward" physicians and expect smooth adoption. UCLA Health observed that approximately 10-20% of physicians simply won't use the technology regardless of training and support, for various legitimate reasons including incompatibility with highly templated note structures, specialty-specific needs, or personal preferences. Another 10-20% became very high users, employing the tool for virtually every encounter. The middle 60-80% showed variable usage patterns, with overall utilization rates around 30-40% of encounters. This distribution suggests that organizations should plan for segmented adoption rather than universal usage. McLeod Health made a significant strategic pivot during their pilot. They initially restricted access to physicians at the 70th percentile or higher for productivity, based on concerns about cost and ROI. This proved counterproductive—the most efficient physicians who already had low documentation burden benefited least from the technology. When they expanded access to physicians at the 30-60th percentile for productivity, these clinicians showed the greatest gains. This finding has important implications for LLMOps deployment strategy: the users who might benefit most from AI assistance may not be the highest performers but rather those struggling with current workflows. The concept of flexible usage patterns also emerged as important. Some clinicians only used the tools for specific visit types (new patient visits, annual exams, or specialty-specific encounters like gynecology visits). Rather than treating this as incomplete adoption, organizations recognized this as appropriate customization. MSK specifically moved away from rigid utilization targets after observing these natural usage patterns, acknowledging that the technology serves as a support tool that clinicians should deploy when it adds value to their specific workflow. ## Outcomes and Impact Measurement The documented outcomes across these organizations demonstrate measurable impact across multiple dimensions. UCLA Health observed approximately 20% reduction in burnout prevalence from their RCT, which they translated into estimated cost savings of approximately $2 million annually based on research suggesting physician burnout costs health systems around $8,000 per physician per year through decreased productivity and turnover. They also saw efficiency gains in time spent writing notes and improvements across psychometric measures of work exhaustion and task load within the relatively short pilot timeframe. McLeod Health documented time savings of 1.5-2 hours per day for many physicians and achieved a hard ROI of $1,000 per provider per month net after subscription costs. This return came primarily through a 9% shift in CPT coding levels, with level 3 visits decreasing and being replaced by level 4 and 5 visits. The AI's ability to capture problem complexity and suggest appropriate ICD-10 codes improved coding accuracy and HCC (Hierarchical Condition Category) capture. Importantly, these gains were achieved while explicitly instructing physicians not to increase patient volume, addressing concerns about AI-driven productivity pressure exacerbating burnout. The patient satisfaction improvements at McLeod were unexpected and particularly significant. A 5-6% improvement in key NRC survey questions (provider listening carefully, trust in provider) substantially exceeded typical improvement from dedicated patient experience initiatives. The panelists attributed this to behavioral changes enabled by the technology—physicians making more eye contact, engaging patients more directly, and practicing better shared decision-making when freed from documentation burden during encounters. Some physicians adopted a practice of providing brief verbal summaries at encounter end that both optimized AI performance and enhanced patient engagement through shared understanding. MSK's comprehensive evaluation plan includes prospective measurement of work RVUs, documentation time, service levels, and financial impact alongside burnout metrics and patient perspectives. They plan to survey patients who experienced ambient documentation to understand patient attitudes and concerns, and are piloting patient-facing features like visit summaries written at appropriate reading levels. This multidimensional measurement approach represents mature LLMOps practice that considers technical performance, clinical outcomes, user experience, and business value simultaneously. ## Clinical Documentation Improvement and Revenue Cycle An emerging area of focus discussed extensively was clinical documentation improvement (CDI) and revenue cycle optimization. Brian Frost from McLeod Health expressed frustration that current models essentially produce "a blob of text" in the medical decision-making section without CDI-aware formatting and phrasing that would optimize coding and billing. He emphasized the need for prompt engineering improvements that teach models the specific language and structure that coders and billing systems expect, noting this has both financial and medical-legal implications. The challenge extends beyond simple accuracy to understanding the downstream workflow that notes feed into. Effective clinical documentation must support multiple purposes simultaneously: clinical communication, legal documentation, billing justification, and quality measurement. Current LLMs trained primarily on clinical language don't necessarily optimize for these multiple objectives without specific fine-tuning or prompt engineering. Several vendors are developing or have beta features for real-time CPT code suggestions and CDI recommendations, but organizations expressed caution about deploying these capabilities without extensive validation. UCLA Health specifically noted they validate all new features before production use and must be comfortable with both the output quality and the risk profile. This represents responsible LLMOps practice—just because a vendor offers a feature doesn't mean it's ready for production deployment without institutional validation. The encounter-based pricing model McLeod negotiated with Suki aligned vendor incentives with organizational adoption success, as the vendor only generates revenue when the tool is actually used. This commercial model structure encourages vendors to focus on features and improvements that drive sustained usage rather than simply maximizing license sales. ## Future Roadmap and Evolution The panelists discussed different philosophies regarding vendor roadmap importance. Paul Lakak from UCLA Health expressed less concern about roadmap because he viewed all players (excluding Microsoft and Solventum) as health tech startups with inherently uncertain futures and overlapping planned capabilities. He predicted commoditization of the core note generation functionality and felt comfortable with a wait-and-see approach, remaining open to switching vendors as the market matures. Brian Frost from McLeod took the opposite view, emphasizing that he selected Suki specifically for their broader technical platform vision beyond note generation. He anticipates the note itself becoming commoditized and is most interested in vendors positioning as comprehensive digital assistants that address multiple sources of physician cognitive burden. Key capabilities he's tracking include clinical decision support integration (he highlighted open evidence as significantly faster than traditional resources like UpToDate), context-aware chart summarization that adapts to specialty-specific needs, conversational AI for real-time clinical queries during patient care, and integration of downstream workflow tasks like order entry. MSK expressed particular interest in nursing ambient documentation solutions just reaching general availability, which could impact both nursing workflow and patient experience. They're exploring "ambient in the room" or "ambient as a service" approaches where ambient capture becomes a built-in facility capability in new buildings rather than requiring individual clinician devices. They're also investigating clinical trial-specific applications, recognizing that cancer center workflows often involve complex research protocols requiring specialized documentation. This diversity of roadmap priorities reflects different organizational strategies for LLM deployment maturity. Organizations further along the curve are thinking beyond point solutions toward integrated AI platforms that address physician workflow comprehensively, while those earlier in adoption are appropriately focused on core functionality and proven capabilities. ## Risk Management and Governance Data privacy and security emerged as critical considerations throughout the discussion. Organizations filtered vendors based on data security concerns before detailed evaluation, and questions arose during deployment about exactly how vendors use audio and text data for model training and improvement. MSK's Abby Baldwin emphasized the importance of understanding vendor data policies and potentially re-evaluating institutional policies around AI-generated content. California's specific requirements around patient consent for audio recording created operational challenges that UCLA Health hadn't fully anticipated. Requiring individual consent for each encounter proved cumbersome, and they recommended building recording consent into annual patient consent-to-treat paperwork rather than requiring per-encounter consent. This represents the type of operational friction that can emerge between healthcare regulations and AI deployment, requiring thoughtful policy solutions. Union considerations also arose, particularly for potential inpatient deployment where nursing unions might have concerns about AI's impact on work experience. UCLA Health emphasized the importance of proactively addressing these concerns early to avoid roadblocks during expansion. The universal emphasis on human review of AI-generated content represents the core governance principle across all organizations: despite significant advances in LLM capabilities, the clinician retains ultimate responsibility for documentation accuracy and completeness. Training specifically addresses how to identify and correct model errors, omissions, and inaccuracies. This human-in-the-loop approach is essential for maintaining safety and quality in high-stakes healthcare documentation. ## LLMOps Maturity and Lessons Learned Several meta-lessons about LLMOps practice emerged from the discussion. First, engagement and enthusiasm don't predict adoption—actual usage patterns can only be determined through deployment and measurement, not predicted from user attitudes. Second, current-state workflow mapping before deployment would have helped MSK better understand where ambient AI would and wouldn't provide value (such as in shared MD/APP visits where the APP does most documentation). Third, vendor responsiveness and willingness to incorporate feedback matters more than being an "established" player in what remains a nascent market. The importance of cluster-based deployment rather than dispersed individual adoption was highlighted—physicians benefit from having colleagues in their clinical location who are also using the technology for peer support and shared learning. Organizations also learned not to give up on users too quickly, as the behavioral adaptation period takes time and some initially unsuccessful users became strong advocates after committing to the learning curve. The panel emphasized that AI scribes are not a "silver bullet" or universal solution. They work exceptionally well for some clinicians, specialties, and visit types while providing minimal value for others. Acceptance of this heterogeneity represents maturity in LLMOps thinking—success doesn't require 100% adoption but rather enabling those who benefit most while respecting that templated workflows, certain specialties, or personal preferences may make traditional documentation methods more appropriate for some users. Finally, the financial model matters tremendously. Traditional per-user-per-month licensing creates pressure to maintain high utilization rates to justify costs and generates administrative overhead managing license assignments. Encounter-based pricing better aligns with variable usage patterns and reduces organizational risk, though it requires different technical infrastructure for usage tracking and billing. ## Synthesis and Production LLM Deployment Principles This panel discussion provides rich insights into production LLM deployment in healthcare settings where stakes are high, workflows are complex, and users are highly trained professionals with domain expertise exceeding the model's capabilities. Several principles emerge that likely generalize beyond healthcare to other enterprise LLMOps contexts: Rigorous, multi-stakeholder evaluation processes that reduce cognitive bias and test edge cases provide better vendor selection than following market trends or perceived leaders. Validated measurement instruments and experimental design (including RCTs where feasible) enable confident decision-making and demonstrate value to stakeholders. Deep workflow integration isn't optional—it's essential for adoption in environments where users have high cognitive load and low tolerance for friction. Training must be ongoing, mandatory, and include not just tool functionality but the conceptual frameworks for working effectively with AI systems and recognizing their limitations. User segmentation and flexible deployment models that accommodate heterogeneous usage patterns generate better outcomes than expecting universal adoption. Organizations should explicitly plan for 10-20% non-adoption rates rather than treating this as failure. Starting with users who have the most to gain rather than those perceived as most tech-savvy improves both outcomes and ROI. Comprehensive measurement frameworks that capture technical performance, user experience, operational outcomes, and business value provide the data needed for iterative improvement and informed decisions about scaling, modifying, or replacing deployed systems. Perhaps most importantly, the discussion revealed that successful LLM deployment at scale requires treating it as organizational change management rather than simply technology implementation. The socio-technical system—including training, support, communication, workflow redesign, pricing models, governance, and culture—matters as much as the underlying model quality in determining whether AI systems deliver value in production healthcare settings.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.