WellSky, serving over 2,000 hospitals and handling 100 million forms annually, partnered with Google Cloud to address clinical documentation burden and clinician burnout. They developed an AI-powered solution focusing on form automation, implementing a comprehensive responsible AI framework with emphasis on evidence citation, governance, and technical foundations. The project aimed to reduce "pajama time" - where 75% of nurses complete documentation after hours - while ensuring patient safety through careful AI deployment.
WellSky is a healthcare technology company that bridges the gap across the care continuum—from acute to post-acute to community care. The company serves more than 2,000 hospitals and 130,000 providers, processing over 100 million forms annually through their systems. Their core focus is on home-based healthcare, including home health, hospice, and personal care services, where clinicians visit patients directly at their homes.
The case study was presented as a panel discussion featuring Joel Doy (CTO of WellSky), Balky (Chief Architect of WellSky), and a product manager from Google’s MedLM team, highlighting the collaborative nature of their partnership with Google Cloud.
WellSky identified a critical pain point in their user base that they termed “pajama time.” A survey conducted approximately six months prior to the presentation revealed that about 75% of their home health nurses were spending time after work hours—when they should be with their families—completing clinical assessments and documentation from patient visits during the day.
This documentation burden is a significant driver of burnout and turnover in the healthcare industry. The problems stem from several factors:
The challenge was clear: how could WellSky leverage generative AI to allow clinicians to spend more time with patients and less time with administrative systems, while maintaining or improving the quality of care?
WellSky’s partnership with Google evolved from an initial focus on data center migration (approximately four years prior) to exploring generative AI capabilities. The interest in GenAI was sparked by the emergence of ChatGPT about 18 months before the presentation, prompting WellSky’s leadership—including their CEO Bill Miller—to explore how this fundamental technology shift could help patients and providers.
The choice of Google as a partner was driven by several factors:
WellSky’s approach to building their AI-powered solution was methodical and focused on establishing sustainable foundations for future AI development. They created a small cross-functional incubation team comprising senior developers, a product manager, a clinician, and legal, security, and compliance personnel.
The team established a comprehensive governance framework with policies, practices, and guidelines for governing the use, development, and deployment of AI across WellSky applications. This included:
The technical workstream focused on learning the AI capabilities available in Google’s Vertex AI platform, including document AI services, while emphasizing solutions to responsible AI risks. Key elements included:
A particularly notable aspect of WellSky’s implementation is their approach to grounding—addressing the risk of hallucinated outputs, especially in ambient conversation settings. They made evidence citations a ground requirement, refusing to launch any feature without it. For document extraction or transcript-based content, every piece of AI-generated output must be accompanied by evidence linking back to the source material. This approach mirrors Google’s own health search functionality, which provides evidence linking to search results alongside generative answers.
The productization workstream addressed the user experience implications of responsible AI, including:
The WellSky team shared several critical lessons from their implementation journey:
Responsible AI adoption can initially feel overwhelming. The most practical approach is to start with use cases that have fewer applicable risks and expand the scope over time as the organization develops confidence and capabilities.
Client readiness for generative AI adoption varies significantly. With thousands of customers ranging from progressive early adopters to more conservative organizations, WellSky needed to gather feedback from across this spectrum. The UI/UX design must account for these varying comfort levels and organizational readiness.
Successful adoption requires AI assistance to be optional. Forcing AI on end users creates disruptive experiences and can undermine adoption. Customers often prefer rolling out features to power users first before broader deployment.
Perhaps most importantly, the team emphasized that there is no AI strategy without an API or data strategy. WellSky found significant value in Google Cloud’s native integration between the Vertex AI platform and database services like BigQuery, as well as healthcare-specific services like FHIR stores. This seamless integration between AI capabilities and underlying data infrastructure proved essential for their implementation.
The case study reveals several mature LLMOps practices:
While specific metrics on production performance were not shared, the structured approach to building reusable components and establishing governance frameworks suggests WellSky is positioning themselves for scaled AI deployment across their product portfolio, not just a single use case.
While the presentation highlights the thoughtful approach WellSky took to implementing generative AI, a few caveats are worth noting. The presentation was made in a partnership context with Google, so there is inherent promotional motivation. Specific quantitative outcomes (reduction in documentation time, error rates, user satisfaction) were not provided—the solution appears to still be in early stages or limited rollout.
The emphasis on responsible AI governance and evidence citation requirements suggests strong awareness of the risks inherent in healthcare AI, though the actual effectiveness of these controls in preventing harmful outputs remains to be demonstrated through broader deployment. The admission that client readiness varies widely also suggests potential challenges in achieving widespread adoption.
Overall, this case study represents a thoughtful approach to deploying LLMs in a high-stakes healthcare environment, with particular attention to governance, grounding, and operational considerations that are essential for responsible production deployment.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.
A panel discussion featuring leaders from Google Cloud AI, Symbol AI, Chain ML, and Deloitte discussing the adoption, scaling, and implementation challenges of generative AI across different industries. The panel explores key considerations around model selection, evaluation frameworks, infrastructure requirements, and organizational readiness while highlighting practical approaches to successful GenAI deployment in production.