ZenML

Best Practices for Implementing LLMs in High-Stakes Applications

Moonhub 2023
View original source

The presentation discusses implementing LLMs in high-stakes use cases, particularly in healthcare and therapy contexts. It addresses key challenges including robustness, controllability, bias, and fairness, while providing practical solutions such as human-in-the-loop processes, task decomposition, prompt engineering, and comprehensive evaluation strategies. The speaker emphasizes the importance of careful consideration when implementing LLMs in sensitive applications and provides a framework for assessment and implementation.

Industry

Healthcare

Technologies

Overview

This case study is derived from a conference talk by Yada, a professional with a background in ML engineering research, working at the intersection of NLP and healthcare/mental health, currently at Moonhub (techster.com). The presentation addresses the critical question of how to responsibly incorporate large language models into high-stakes environments, particularly in healthcare and mental health applications.

The talk is notably cautionary in nature, emphasizing that while LLMs show impressive capabilities in medicine and law according to various headlines, there are significant caveats and failure modes that practitioners must consider before deploying these models in production environments where errors can have serious consequences.

The High-Stakes Context: Therapy Bot Example

The speaker uses a hypothetical therapy bot as a running example throughout the talk to illustrate the unique challenges of deploying LLMs in sensitive domains. This is an effective pedagogical choice because therapy bots represent one of the most challenging applications for LLMs, requiring consideration of:

Failure Modes and Robustness Concerns

The presentation acknowledges several categories of LLM failure that are particularly concerning in high-stakes environments:

Best Practices for Production LLM Deployment

Learning from Previous Paradigms

The speaker recommends looking back at pre-LLM approaches for lessons in controllability. Specifically, referencing dialogue systems built with platforms like DialogFlow that used more structured approaches:

These structured approaches offer insights into how to maintain controllability even when incorporating LLMs into the system.

Human-in-the-Loop Design Patterns

A central theme of the talk is the importance of human oversight in high-stakes LLM applications. The speaker outlines several implementation patterns:

Task Decomposition Strategies

Breaking complex tasks into smaller, more manageable subtasks is recommended as a risk mitigation strategy. For example, in an information retrieval product, rather than attempting to match a question to an entire document, the system should:

Simplifying the Problem Space

The talk offers several guidelines for making LLM tasks more tractable:

Prompt Management and Retrieval

For organizations using off-the-shelf models through APIs, the speaker emphasizes the importance of:

Ensemble Methods

Drawing from traditional data science practices, the speaker advocates for ensemble approaches:

When Not to Use LLMs

A refreshingly practical recommendation is to avoid LLMs when simpler approaches suffice. In high-stakes environments, this might mean:

Fine-Tuning Considerations

The speaker suggests that fine-tuning your own LLM offers several advantages for high-stakes applications:

Evaluation Frameworks for High-Stakes Deployment

The talk emphasizes rigorous evaluation practices:

Critical Assessment of External Benchmarks

The speaker provides a framework for evaluating external benchmark results and headlines about LLM capabilities:

If there is significant distance on either dimension, impressive benchmark numbers may not translate to production performance.

Open Questions Highlighted

The talk acknowledges several unresolved challenges in deploying LLMs in high-stakes settings:

Key Takeaways for LLMOps Practitioners

This talk provides a sobering counterbalance to the hype around LLM capabilities. For practitioners working on high-stakes applications, the key messages are:

The emphasis throughout is on responsible deployment and risk mitigation rather than capability maximization, which is appropriate guidance for anyone deploying LLMs in environments where errors have significant consequences.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49

Running LLM Agents in Production for Accounting Automation

Digits 2025

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

healthcare fraud_detection customer_support +50