The presentation discusses implementing LLMs in high-stakes use cases, particularly in healthcare and therapy contexts. It addresses key challenges including robustness, controllability, bias, and fairness, while providing practical solutions such as human-in-the-loop processes, task decomposition, prompt engineering, and comprehensive evaluation strategies. The speaker emphasizes the importance of careful consideration when implementing LLMs in sensitive applications and provides a framework for assessment and implementation.
This case study is derived from a conference talk by Yada, a professional with a background in ML engineering research, working at the intersection of NLP and healthcare/mental health, currently at Moonhub (techster.com). The presentation addresses the critical question of how to responsibly incorporate large language models into high-stakes environments, particularly in healthcare and mental health applications.
The talk is notably cautionary in nature, emphasizing that while LLMs show impressive capabilities in medicine and law according to various headlines, there are significant caveats and failure modes that practitioners must consider before deploying these models in production environments where errors can have serious consequences.
The speaker uses a hypothetical therapy bot as a running example throughout the talk to illustrate the unique challenges of deploying LLMs in sensitive domains. This is an effective pedagogical choice because therapy bots represent one of the most challenging applications for LLMs, requiring consideration of:
Therapeutic Framework Adherence: A therapy bot needs to operate within established therapeutic frameworks such as Cognitive Behavioral Therapy (CBT) or family dynamics approaches. This requires a level of controllability that general-purpose LLMs may struggle to maintain consistently.
Bias and Fairness Concerns: The speaker highlights a concrete example where speech-to-text components in a call-in therapy service could have accuracy issues with accented speakers, leading to downstream degradations in the experience for users with non-standard accents. This cascading effect of bias through the ML pipeline is a critical consideration for any production LLM system.
State Management: Healthcare applications require careful tracking of patient information including social history, emergency contacts, and other critical data during intake flows. The system must reliably maintain dialogue state throughout interactions.
The presentation acknowledges several categories of LLM failure that are particularly concerning in high-stakes environments:
Distribution Shift Robustness: Models may perform differently when encountering inputs that differ from their training distribution, which is almost guaranteed in real-world healthcare deployments where patient populations are diverse.
Symmetrical Equivalent Perturbations: Small changes to inputs that should not affect outputs can sometimes cause significant changes in model behavior.
Low Resource Settings: Performance degradation in domains or languages with limited training data.
Factuality Issues: As context windows and grounding documents increase in size, factuality becomes increasingly challenging to maintain. The speaker notes this remains an issue even with more capable models like GPT-4.
The speaker recommends looking back at pre-LLM approaches for lessons in controllability. Specifically, referencing dialogue systems built with platforms like DialogFlow that used more structured approaches:
These structured approaches offer insights into how to maintain controllability even when incorporating LLMs into the system.
A central theme of the talk is the importance of human oversight in high-stakes LLM applications. The speaker outlines several implementation patterns:
Breaking complex tasks into smaller, more manageable subtasks is recommended as a risk mitigation strategy. For example, in an information retrieval product, rather than attempting to match a question to an entire document, the system should:
The talk offers several guidelines for making LLM tasks more tractable:
For organizations using off-the-shelf models through APIs, the speaker emphasizes the importance of:
Drawing from traditional data science practices, the speaker advocates for ensemble approaches:
A refreshingly practical recommendation is to avoid LLMs when simpler approaches suffice. In high-stakes environments, this might mean:
The speaker suggests that fine-tuning your own LLM offers several advantages for high-stakes applications:
The talk emphasizes rigorous evaluation practices:
The speaker provides a framework for evaluating external benchmark results and headlines about LLM capabilities:
If there is significant distance on either dimension, impressive benchmark numbers may not translate to production performance.
The talk acknowledges several unresolved challenges in deploying LLMs in high-stakes settings:
This talk provides a sobering counterbalance to the hype around LLM capabilities. For practitioners working on high-stakes applications, the key messages are:
The emphasis throughout is on responsible deployment and risk mitigation rather than capability maximization, which is appropriate guidance for anyone deploying LLMs in environments where errors have significant consequences.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.