Lubu Labs: Clinical-Grade Patient Education Agent with LangGraph and LangSmith

Company

Lubu Labs

Title

Clinical-Grade Patient Education Agent with LangGraph and LangSmith

Industry

Healthcare

Link

https://www.lubulabs.com/ai-blog/healthcare-ai-patient-education

Year

2026

Summary (short)

Lubu Labs built a production AI agent for a digital health platform that helps patients understand their health test results from camera-based scans measuring 30+ vital signs. The system needed to provide plain-language medical explanations, answer follow-up questions conversationally, and route uncertain cases to clinicians—all while meeting healthcare regulatory requirements. The solution used LangGraph for explicit control flow with confidence-based routing decisions, RAG over a versioned medical knowledge base, and LangSmith for audit-grade observability. Key results included approximately 15% of conversations appropriately triggering human review, an 80% accuracy rate in routing decisions validated by clinicians, a 40% reduction in false positive reviews after threshold tuning, and very low rates of inappropriate clinical advice in production validated through weekly audits.

## Overview Lubu Labs developed and deployed a clinical-grade patient education AI agent for a digital health platform that converts smartphone cameras into health monitoring devices capable of measuring 30+ vital signs through facial scanning. The system specifically focuses on helping patients understand dermatology test results in conversational natural language. This case study is particularly instructive for LLMOps practitioners because it demonstrates how the stakes of production AI deployment escalate dramatically in regulated, safety-critical domains like healthcare, where hallucinations aren't just user experience problems but potential patient harm scenarios. The case study explicitly positions itself as a follow-up to their previous SDR chatbot deployment, but with significantly higher stakes. Where the SDR agent touched revenue operations with tolerable error margins, this patient education agent operates in a domain where regulatory compliance (HIPAA, medical device regulations) and clinical safety are non-negotiable. The authors make clear that the difference between a healthcare demo and a shippable production system wasn't about finding better prompts or more capable models—it was about implementing **explicit control flow** and **audit-grade observability** as foundational architectural principles. ## The Production Challenge The system requirements reveal the complexity of healthcare AI deployment. The agent needed to interpret dermatology test results from camera-based health scans, provide plain-language summaries accessible to non-medical users, handle follow-up questions conversationally, deliver tailored educational content based on findings, and guide patients toward appropriate next steps ranging from self-care to booking clinical appointments. What distinguishes this from typical chatbot implementations are the hard constraints imposed by the healthcare context. Clinical accuracy is non-negotiable—hallucinated information can directly harm patients. Regulatory audit trails are mandatory under HIPAA and medical device regulations, requiring immutable logs with complete traceability of what the system said, when, and based on what information. Uncertainty handling becomes critical because the system must know when to route to human clinicians rather than attempting answers that could be unsafe. Perhaps most importantly, there's a bright line boundary between education and diagnosis that must be enforced in code architecture, not merely through prompt engineering. The team identified two immediate technical challenges that threatened deployment viability. First, **invisible hallucinations** where the model would sound confident even when knowledge base retrieval was weak or uncertain. Second, **unclear routing logic** where it was difficult to determine programmatically when it was safe to provide educational content versus when cases required handoff to clinical professionals. ## Why Standard Approaches Fail in Healthcare The case study criticizes common LLM deployment patterns that lack explicit control flow and comprehensive observability. Without these foundations, teams cannot answer basic but critical questions: Did a particular answer come from approved medical knowledge or from the model's training data? What exactly did the AI say to a patient three weeks ago and what was its reasoning? Are iterative prompt changes causing quality drift in clinical accuracy? When should the model explicitly say "I don't know" versus attempting to formulate an answer? The authors describe becoming reactive rather than proactive in one early test where they discovered a hallucination three days after the conversation through manual review. In healthcare AI, this detection latency is simply unacceptable from both a patient safety and liability perspective. This experience drove their architectural focus on making system behavior visible and auditable by design. ## Technical Architecture: LangGraph State Machine The solution models the patient education agent as a **LangGraph state machine** where each stage of conversation handling is an explicit node, and conversation state carries medical findings, confidence scores, and routing decisions forward through the graph. The case study provides concrete code examples showing their `PatientEducationState` class extending `MessagesState` with additional fields for test results, confidence scores, review flags, and thread IDs for checkpointing and audit trails. The architecture's critical insight is making safety boundaries explicit in code rather than implicit in prompts. Their `should_route_to_human` function is explicitly not an LLM decision—it's threshold-based routing logic that examines confidence scores and review flags to determine whether cases proceed to the education module or get routed to clinician review. The confidence threshold is set at 0.7, meaning cases with lower confidence scores automatically trigger human review regardless of what the LLM might say. The graph structure includes nodes for extracting findings from test results, retrieving from the medical knowledge base, delivering educational content, and triggering clinician review. Conditional edges implement the routing logic. The entire graph compiles with a PostgreSQL-backed checkpointer that enables session resumability—allowing patients to leave conversations and return hours or days later without losing context. This checkpointing also serves the audit trail requirement by creating persistent records of conversation state transitions. The key architectural principles emphasized are: explicit routing based on confidence scores rather than LLM decisions, LangGraph checkpointing for session resumability, and human-in-the-loop as a first-class designed path rather than an afterthought fallback mechanism. ## Medical Knowledge Base with Confidence Scoring The knowledge architecture uses RAG (Retrieval-Augmented Generation) over an approved, version-controlled medical knowledge base. Critically, every retrieval operation returns not just content but a confidence score based on evidence quality. This dual return allows the system to make programmatic decisions about whether retrieved information is sufficient to ground a safe answer. In practice, knowledge base entries are tagged by source type—peer-reviewed literature, clinical guidelines, or patient education materials—which affects their authority weighting. Retrieval scores combine semantic similarity (how well the query matches the content) with source quality (the authority level of the source material). When retrieval produces low-confidence results, this automatically triggers human review rather than allowing the system to proceed with weak grounding. The case study emphasizes treating the medical knowledge base like production code with version control, change tracking, and rollback capability. This isn't just good practice—it's a regulatory requirement. In healthcare, the question "what knowledge did the AI have access to at the moment it provided this answer?" must be answerable months or years later during audits or incident investigations. The knowledge base version is part of the metadata tagged on every LangSmith trace for exactly this reason. ## Explicit Confidence-Based Routing Rather than relying on the model to determine when it should be uncertain, the team made uncertainty a state machine decision based on quantifiable metrics. Every answer receives a combined confidence score that incorporates both retrieval quality (how good were the knowledge base matches) and LLM certainty (how confident is the model in its formulation). Threshold-based routing uses 0.7 as the cutoff: below 0.7 routes to human review, at or above 0.7 proceeds with automated response. Beyond numerical thresholds, specific conversation triggers also mandate review. When patients ask about symptoms, medications, or anything resembling diagnosis, the system routes to human review regardless of confidence scores. This implements the education versus diagnosis boundary as code-level guardrails. The philosophy here is crucial for healthcare AI deployment: don't try to make the model "smarter" at edge cases through better prompting or fine-tuning. Instead, route edge cases to humans explicitly through architectural decisions. The goal is not to answer every question autonomously, but to answer every question safely—which often means knowing when not to answer. The results after threshold tuning demonstrate this approach's viability. Approximately 15% of conversations triggered human review, mostly for edge cases or off-topic questions. Of those routed cases, about 80% were determined to be appropriately routed based on clinician feedback loops, indicating the routing logic was largely correct. The false positive rate (cases unnecessarily routed to humans) decreased by 40% after tuning the threshold from an initial 0.8 down to 0.7, finding a better balance between over-cautiousness and appropriate autonomy. Production monitoring showed very low rates of inappropriate clinical advice, validated through weekly audit sampling protocols. ## LangSmith for Audit-Grade Observability LangSmith tracing provides end-to-end timelines for every conversation, capturing LLM calls, retrieval results, confidence scores, routing decisions, and associated metadata. The case study positions this not as a "nice to have" debugging tool but as fundamental liability protection infrastructure. The regulatory compliance requirement for immutable audit trails means that months after a conversation, the system must be able to reconstruct exactly what the AI said, when it said it, based on what knowledge, using what model version, with what confidence level. This level of traceability isn't achievable with standard application logging—it requires specialized observability tooling that understands the structure of LLM applications. Beyond compliance, LangSmith traces enable debugging medical accuracy issues by exposing the exact retrieval results and model reasoning that led to particular outputs. Without these traces, debugging becomes guesswork—trying to reproduce problems without visibility into what actually happened in production. The case study also describes using LangSmith for continuous monitoring to detect hallucination patterns, specifically cases where answers weren't properly grounded in the knowledge base despite sounding plausible. The PostgreSQL-backed checkpointing integration with LangSmith provides session resumability from an operational perspective while also creating the persistent state history needed for audit purposes. This dual function—operational capability and compliance requirement—exemplifies how infrastructure choices in healthcare AI serve multiple critical purposes. The practical advice offered is to tag every LangSmith run with metadata required for compliance: environment identifiers, patient IDs (properly hashed for privacy), session IDs, knowledge base versions, and model versions. The emphasis is that in healthcare, the question "what happened?" must remain answerable months or years later, not just during active development. ## Production Failure Modes and Resolutions ### The Invisible Hallucination Problem During testing, the team discovered instances where the AI confidently explained medical conditions not actually present in their approved knowledge base. The root cause analysis revealed that when retrieval returned low-quality matches, the model filled knowledge gaps with plausible-sounding but unverified information from its training data. This is particularly dangerous in healthcare because confident but false medical statements can directly harm patients by encouraging inappropriate self-treatment or discouraging necessary clinical consultation. LangSmith traces exposed the failure mechanism: retrieval was producing weak matches with low confidence scores, but the model's response didn't reflect that uncertainty. Instead of acknowledging insufficient information, the model "bridged the gap" with training data rather than restricting itself to approved knowledge sources. The fix involved enforcing grounding constraints at the architectural level. If retrieval confidence falls below 0.6, the system now explicitly returns "I don't have enough information to answer safely" and routes the case to human review. Only high-confidence documents are passed into the LLM context window. Additionally, the team built a LangSmith evaluation dataset containing "known unknowns"—questions they intentionally don't have approved answers for—to catch this hallucination pattern during testing before deployment. The key lesson articulated is that in healthcare, "I don't know" is a feature, not a bug. Building explicit guardrails that prevent the model from bridging low-confidence gaps is essential for safe deployment. ### The Compliance Gap: No Audit Trail During a test audit simulation, the team discovered they couldn't reconstruct what the AI had told a patient three weeks earlier. Early logging implementations had captured final messages but not the full reasoning chain—what was retrieved, what routing decisions were made, what confidence scores were calculated. This was identified as a deployment blocker because healthcare regulations require immutable audit trails where "the AI said X" isn't sufficient documentation. The complete record must show "the AI said X because it retrieved Y with confidence Z at timestamp T using knowledge base version V and model version M." The fix involved comprehensive LangSmith metadata tagging with all compliance-relevant identifiers, structured trace export capabilities for compliance teams, PostgreSQL checkpointing for session continuity, and PII redaction at the trace level (redacted for monitoring dashboards, full detail preserved for audit exports with appropriate access controls). The lesson emphasized is that observability in healthcare isn't primarily about debugging—it's about liability protection. This infrastructure must be built from day one, not retrofitted after discovering compliance gaps. ## Production Deployment Philosophy The case study concludes by arguing that in healthcare AI, observability and explicit control flow are not optional optimizations—they're liability protection mechanisms that determine whether a system is legally and ethically shippable. The value of LangSmith in this deployment wasn't about improving model intelligence or capability. It was about creating visibility into system behavior and audit trails that satisfy regulatory requirements. The core principles for shipping healthcare AI in production are identified as: medical grounding is mandatory through RAG over approved knowledge bases with confidence scoring, preventing the model from bridging gaps with training data; uncertainty is a feature requiring explicit confidence thresholds and human routing for low-confidence cases; audit trails equal liability protection through comprehensive LangSmith trace metadata; and boundaries must be implemented in code, not prompts, using explicit routing logic rather than hoping the model will understand safety constraints. The single most important practice recommended is making clinical accuracy measurable through LangSmith evaluations with known-good and known-bad examples, then monitoring it continuously through production traces with alerting on anomalous patterns. ## Critical Assessment While this case study provides valuable technical patterns for healthcare AI deployment, several aspects warrant balanced consideration. The case study is published by Lubu Labs as marketing content demonstrating their capabilities, which naturally emphasizes successes over struggles. The metrics provided (15% human review rate, 80% appropriate routing, 40% false positive reduction) lack denominators and baseline comparisons that would allow assessment of absolute performance. The claimed "very low rate of inappropriate clinical advice in production" is vague and unquantified—in healthcare contexts, even "very low" rates of harmful advice may be unacceptable depending on severity. The validation methodology through "weekly audit sampling" suggests manual review rather than comprehensive automated testing, raising questions about coverage and what might be missed between samples. The emphasis on LangGraph and LangSmith as the enabling technologies may overstate their uniqueness—the underlying patterns (state machines, confidence thresholds, structured logging) are implementable with other tools, though LangSmith's healthcare-oriented features may reduce implementation friction. The "approved knowledge base" is mentioned but not detailed—the challenge of creating, maintaining, and validating medical knowledge bases for AI consumption is substantial and understated here. The deployment is positioned as "clinical-grade" and "shippable," but actual regulatory approval status is unclear. Is this FDA-cleared as a medical device? What specific regulatory pathway was followed? The compliance discussion focuses on audit trails and traceability but doesn't address the broader regulatory requirements for medical AI systems. Nevertheless, the case study articulates genuinely important LLMOps principles that extend beyond healthcare. The emphasis on making safety boundaries explicit in code rather than implicit in prompts, treating "I don't know" as a feature rather than failure, building observability for auditability not just debugging, and using confidence-based routing to humans for edge cases—these patterns apply to many high-stakes LLM deployments in finance, legal, and other regulated domains. The technical architecture with LangGraph state machines, confidence-scored retrieval, and structured LangSmith tracing demonstrates a mature approach to production LLM systems that moves well beyond the "chatbot with RAG" baseline many teams start with. The explicit treatment of human-in-the-loop as a first-class architectural component rather than an afterthought is particularly valuable for practitioners building systems where 100% autonomous accuracy is neither achievable nor acceptable.

Start deploying reproducible AI workflows today