John Snow Labs: Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

LLMOps Database

Healthcare

John Snow Labs

Company

John Snow Labs

Title

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

Industry

Healthcare

Link

https://www.youtube.com/watch?v=VKuW6n25VdM&t=5s

Year

2024

Summary (short)

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

## Overview John Snow Labs, a nine-year-old company specializing in Healthcare NLP, has developed an enterprise platform for integrating multimodal medical data into unified patient journeys. The platform represents an evolution from their earlier work on single-document extraction (pathology reports, radiology reports, adverse events, clinical codes) to comprehensive patient-level analysis across years of medical history. This webinar presentation by David Alby, CTO of John Snow Labs, details the technical architecture, LLMOps considerations, and production deployment requirements for this system. The core problem being addressed is that real patient data is fragmented across multiple sources, modalities, and time periods. A typical cancer patient in the US generates over a thousand pages of text per year, including chemotherapy records, radiology reports, psychological screenings, dietary assessments, genomic results, and more. This data exists in structured EHR systems, unstructured clinical notes, semi-structured FHIR resources, and sometimes as PDFs emailed between providers. The goal is to unify all this "garbage data" into a queryable knowledge base that clinicians, nurses, administrators, and researchers can interact with using natural language. ## Enterprise and Production Requirements The platform is explicitly designed as an enterprise-grade production system rather than a proof-of-concept or pilot. Several key LLMOps requirements are emphasized: **Infrastructure and Security**: The system is deployed entirely within the customer's infrastructure—on-premises or within their cloud security perimeter. It does not call external LLMs and never sends data to John Snow Labs, cloud providers, or services like OpenAI. This is critical for handling Protected Health Information (PHI) in HIPAA-compliant environments. The deployment is Kubernetes-based, with different containers for different processing components, allowing independent scaling of GPU-intensive tasks (like OCR for PDFs) versus CPU-only tasks (like tabular data processing). **Scalability**: The platform is designed for organizations serving 10-30 million patients per year, with potentially half a million data points per patient over a 5-year period. This translates to billions of documents, making long-context LLMs "completely meaningless" for this use case—the scale requires a different architectural approach. The system supports ongoing data updates so that queries always reflect the latest patient information (e.g., "What are the latest lab results?"). **Open Standards**: After deliberation between proprietary solutions that would "do everything we want" versus open solutions that would "do almost everything we want," John Snow Labs chose to adopt the OMOP (Observational Medical Outcomes Partnership) common data model. This relational database model has an ecosystem of open-source tools, enabling interoperability with existing healthcare analytics infrastructure, Tableau/PowerBI dashboards, Python notebooks for data scientists, and standard SQL queries. Some custom extensions were added for traceability and confidence tracking, but the core remains standards-compliant. ## Data Processing Pipeline The system employs a sophisticated multi-stage pipeline for processing incoming data: **Information Extraction**: Healthcare-specific LLMs extract over 400 types of entities from free text, including medications, anatomical parts, treatment paths, social determinants, family history, suspected events, negated diagnoses, biomarkers, histological values, and radiology findings. Critically, these models also perform assertion status detection (present, absent, historical, family-related) and relation extraction. For example, extracting "breast cancer" twice from a sentence requires understanding that the first instance refers to the patient's current condition while the second refers to family history. These models are trained and validated by medical doctors on the John Snow Labs team. **Terminology Normalization**: Extracted entities and codes from various sources (FHIR, structured EHR data, claims) are mapped to standardized terminologies, primarily SNOMED, with support for 12 other terminology systems. This addresses the real-world problem that different data sources use different coding systems (RxNorm, NDC codes, ICD-10, custom pathology codes). **Temporal Normalization**: Relative dates in clinical text (e.g., "last year," "three days ago") are normalized to absolute dates based on the document's timestamp. This enables temporal queries like "find patients who received a positive diagnosis within the last 18 months" or "what drugs are prescribed after patients are diagnosed with atrial fibrillation." **Deduplication and Reasoning**: Raw extracted data contains massive redundancy—a three-day hospital stay might generate 30-40 progress notes with repeated symptom mentions. The system uses LLM-based reasoning to merge, deduplicate, and resolve conflicting information. This includes selecting the most specific diagnosis from multiple mentions, averaging repeated measurements, and carrying forward uncertainty appropriately. The result is a patient knowledge graph with the actual 4-6 diagnoses rather than 40-60 mentions. ## LLM Architecture and Performance John Snow Labs uses their own fine-tuned medical LLMs throughout the pipeline, with three main application areas: - Information extraction and understanding of unstructured text - Reasoning for merging, deduplication, and summarization - Query understanding and SQL generation These models have been benchmarked against GPT-4 in blind, randomized evaluations by practicing medical doctors across clinical text summarization, clinical information extraction, and medical question answering. The results showed John Snow Labs models outperforming GPT-4 at nearly a 2:1 ratio across factuality, clinical relevancy, and conciseness dimensions. This performance advantage is attributed to fine-tuning on domain-specific data by medical professionals for specific clinical tasks. ## Natural Language Query System The query interface is designed for clinical end users (doctors, nurses, administrators) rather than data analysts, enabling questions like "show patients who were diagnosed with back pain and had spinal fusion." The system understands that "back pain" maps to multiple ICD-10 and SNOMED codes, and "spinal fusion" encompasses several procedural codes. However, the architecture explicitly avoids treating this as a simple text-to-SQL problem. Key learnings include: **Accuracy Challenges**: Healthcare queries are far more complex than typical text-to-SQL benchmarks. A simple query like "find all diabetic patients" can translate to multi-page SQL statements requiring joins across patient tables, clinical event tables, diagnosis tables, and terminology hierarchies. **Consistency Requirements**: The system must return identical results for identical queries. Inconsistent responses would quickly destroy user trust—doctors cannot be told to "just ask again" if they don't like an answer. **Performance Optimization**: Naively translating natural language to SQL could generate catastrophic queries (SELECT * from 200-million-row tables). The system uses indices, materialized views, and caching, with the LLM fine-tuned to generate optimized query patterns rather than arbitrary SQL. The solution is an AI agent architecture with multiple steps: - A RAG component that maps incoming queries to pre-built, optimized query templates - A terminology service that resolves medical terms to appropriate codes (SNOMED, NDC, etc.) - A schema adjustment tool that modifies queries based on the specific request (e.g., replacing "by age" with "by gender") - Result aggregation and user-friendly presentation **Explainability**: Perhaps the most critical production requirement is explaining results to users. Different stakeholders have different definitions of the same terms—"diabetic patient" could mean having an ICD-10 code in the EHR, taking insulin, or having two abnormal HbA1c tests in six months. The system shows both the business logic used and the provenance of specific patient inclusions, enabling users to understand and challenge results. ## Clinical Value Proposition The presentation argues that while users initially adopt the system for convenience (enabling broader access to data, reducing dependency on data analyst teams), the sustained value comes from accuracy improvements. Studies show that ICD-10 codes are missing from 30-50% of patients in structured data alone. By combining structured EHR data, unstructured clinical notes, and FHIR resources, the platform captures information that would otherwise be lost—medication history mentioned verbally but not recorded in the EMR, out-of-network procedures, screenings documented only in radiology reports. This has direct implications for clinical decision support, risk adjustment, revenue cycle optimization (HCC coding), and population health analytics. Case studies from Rush (oncology patient timelines) and West Virginia University (HCC coding for revenue cycle) are referenced as real-world deployments. ## Deployment Model The platform is licensed by server rather than by user, patient count, or modality. Implementation involves a 12-week project where John Snow Labs partners with the customer team to: install infrastructure, configure security and scaling, integrate data sources (from S3 folders to Epic Kaboodle to FHIR APIs to DICOM images), customize terminology mappings and extraction models, and optimize queries for the organization's specific needs. The system runs on-premises or on AWS, Azure, Databricks, Snowflake, or Oracle, with all intellectual property (custom models, workflows, extracted data) belonging to the customer. ## Limitations and Considerations While the presentation is from a vendor perspective, several practical challenges are acknowledged: the complexity of healthcare terminology and the impossibility of universal definitions, the need for ongoing customization based on organizational needs, and the inherent messiness of real-world clinical data. The emphasis on explainability reflects an understanding that AI systems in clinical settings must be transparent and auditable rather than black boxes.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source