Dandelion Health: Healthcare NLP Pipeline for HIPAA-Compliant Patient Data De-identification

LLMOps Database

Healthcare

Dandelion Health

Company

Dandelion Health

Title

Healthcare NLP Pipeline for HIPAA-Compliant Patient Data De-identification

Industry

Healthcare

Link

https://www.youtube.com/watch?v=1U3lVEBE5po

Year

2023

Summary (short)

Dandelion Health developed a sophisticated de-identification pipeline for processing sensitive patient healthcare data while maintaining HIPAA compliance. The solution combines John Snow Labs' Healthcare NLP with custom pre- and post-processing steps to identify and transform protected health information (PHI) in free-text patient notes. Their approach includes risk categorization by medical specialty, context-aware processing, and innovative "hiding in plain sight" techniques to achieve high-quality de-identification while preserving data utility for medical research.

Tags

healthcare

regulatory_compliance

## Overview Dandelion Health is a healthcare data company that partners with leading health systems to provide safe, ethical access to curated de-identified clinical data. Their mission is to catalyze healthcare innovation by building what they describe as the largest AI-ready training and validation dataset in the world, covering millions of patients with comprehensive longitudinal multimodal clinical data. Ross Brier, Head of Engineering at Dandelion Health, presented this case study at the NLP Summit, detailing their approach to de-identifying free text patient notes—a critical and non-trivial challenge in healthcare data processing. The core problem Dandelion faces is balancing two competing concerns: if de-identification is not aggressive enough, there are significant HIPAA privacy risks; conversely, if de-identification is too aggressive, important research information is lost. The goal is to "have your cake and eat it too"—maintaining patient privacy while preserving the data's utility for medical research and clinical AI development. ## Regulatory Framework and Compliance The work is governed by the HIPAA Privacy Rule, which protects all individually identifiable health information across 18 defined identifiers (names, dates, geographic data, Social Security numbers, etc.). When personally identifiable information (PII) combines with healthcare information, it becomes Protected Health Information (PHI). HIPAA provides two de-identification methods: formal determination by a qualified expert, or removal of specified individual identifiers. Dandelion uses the expert determination approach, which allows them to explicitly measure risk and perform more nuanced data transformations rather than simple blanket removal. ## Infrastructure and Security Architecture Dandelion's infrastructure demonstrates thoughtful production engineering for handling sensitive healthcare data. For each partner hospital system, they maintain a dedicated account within the hospital's AWS organization. The architecture is designed with zero trust principles—critically, their AWS account has no internet access whatsoever. This air-gapped environment ensures that all processes either use AWS services within the AWS private network or containerized applications running on AWS Elastic Container Service. The de-identification pipeline itself operates as follows: de-identification requests are sent to AWS SQS (Simple Queue Service), which triggers tens or hundreds of Lambda jobs to perform the actual transformations. These transformations include redaction, hashing, or replacement of PHI. This serverless, queue-based architecture allows for scalable parallel processing of large volumes of clinical notes. Quality assurance is built into the process: QA team members from both Dandelion and the hospital system confirm de-identification was performed correctly before any data moves to the de-identified data store for egress. If verification fails, the transformed data is discarded and raw data is reprocessed. ## The Challenge of Free Text De-identification While tabular data de-identification is relatively straightforward (a column either contains non-PHI and remains unchanged, or undergoes a specific transform like redacting SSN columns or shifting dates), free text patient notes present significantly greater challenges. The entire note must be verified as PHI-free, with PHI words and phrases redacted or masked. The presentation honestly acknowledges that simply running Healthcare NLP and scrubbing detected PHI is insufficient. The NLP requires context for detecting PHI, and sometimes that context is problematic—not due to any fault of the NLP itself or the hospital system, but due to the messy nature of real-world clinical documentation. Several real-world edge cases were highlighted: **PDF Conversion Issues**: Some notes stored line-by-line from PDF documents may include page footers containing patient last names and dates. When stitched together, a word like "DOE" in all caps could be a patient name but appears as an acronym (Date of Event, Date of Expiration, Department of Education). **ASCII Table Formatting**: Some patient notes contain data stored as ASCII tables with plus signs, minuses, and pipes as delimiters. Identifying a date of birth becomes extremely difficult when it appears as `Name | DOB | Weight` followed by `John Doe | 5/10/1950` rather than explicit "Date of Birth: 5/10/1950" format. **Helpful Structures**: Conversely, some note structures aid de-identification—notes with clear headings/subheadings make it easier to identify sections to keep or remove, and templated forms for clinicians tend to have cleaner context for NLP processing. ## The Multi-Step De-identification Process Dandelion's solution adds three critical preprocessing steps before running Healthcare NLP: **Step 1 - Modality Categorization**: Notes are categorized by type (radiology reports, echocardiogram narratives, progress notes, etc.). Radiology work, for example, doesn't involve direct patient engagement, so those reports are broadly lower risk for PHI—they contain details like mass sizes on X-rays rather than personal patient backgrounds. Procedure documents requiring descriptions ("61-year-old male") inherently contain PHI details, and manually-entered patient interviews are high risk by nature. **Step 2 - Risk Level Determination**: Even with assumptions about modality risk, Dandelion performs both manual and automated confirmation. Just because patient names typically don't appear in radiology reports doesn't mean none have PHI-containing headers/footers or embedded PDF data. **Step 3 - Risk Reduction Strategies**: Category-specific preprocessing includes redacting headers/footers with information already available in the EHR (names, addresses, dates), extracting embedded tables into separate data objects, and setting aside particularly tricky notes. Several interesting edge cases emerged in their work: - **Obstetrics notes**: Most age-related information uses phrases like "61 years old," but obstetrics includes phrases like "30 weeks" or "10 days" implying fetal/newborn age, requiring extra care for non-standard age expressions. - **CPT codes vs. ZIP codes**: Five-digit procedure codes for billing closely resemble ZIP codes. For California hospital systems, five-digit numbers starting with 9 are broadly assumed to be ZIP codes; outside California, they're assumed to be CPT codes. Additional context checking is required—for example, 91311 might be COVID-related (CPT code) or might follow an address (ZIP code). ## Post-processing: Hiding in Plain Sight (HIPS) After running Healthcare NLP, Dandelion applies a post-processing technique called "Hiding in Plain Sight" (HIPS). For each redacted token, the system replaces it with a similar token from a dictionary. For example, the name "Bob" might be replaced with "Aaron" 5% of the time, "Antonio" 6%, "Ben" 10%, etc. The percentages intentionally don't sum to 100% because some instances will be missed. The key insight is that if a name is missed by the NLP, it won't be obvious to an attacker because that name would need to be compared against alternative cycling names. The attacker would need to know with certainty that the original was "Bob" and not one of the replacement names—a much harder identification task. Based on cited research over the past 10 years, this approach reduces re-identification risk by approximately 90%. If PHI detection recall is originally 95%, HIPS pushes effective recall upwards of 99%. The presentation showed a concrete example: without HIPS, missed names like "Lynn" and "White" and date "May 10th" are obvious; with HIPS, these values mix with other circulated names (Jones, How) and dates (May 28th), making misses non-obvious. ## Validation and Quality Assurance Success is measured primarily through recall: true positive PHI redacted divided by (true positives + false negatives). This measurement is performed manually by clinically trained human analysts who review sample result sets from automated de-identification. If the process misses more than a couple of dates or leaves a Social Security number in the dataset, the team revamps the de-identification approach. Cross-validation ensures formatting consistency across years and subcategories to avoid overfitting to particular data subsets. An interesting precision concern was also raised: date jittering shifts a specific number of days per patient to maintain chronological event ordering. However, if a phrase like "normal value range of 1 to 2" is written as "1-2" and jittered to "1-5," an attacker knowing the original range could reverse-engineer the shift and back-calculate all other dates for that patient. Thus precision (avoiding false positives that reveal masking mechanisms) matters alongside recall. The manual review process requires at least two analysts per document. Discrepancies are triaged by a third reviewer, which may result in analyst retraining/replacement for low quality or additional process clarification. If reviewer quality is good and results meet thresholds, final reports are compiled with recall figures for each HIPAA identifier type (names found/missed, dates found/missed, locations, provider names, etc.) and shared with hospital system partners. ## Production Pipeline Summary The complete end-to-end process consists of: - Categorizing each note based on modality - Determining risk level for each category - Performing text preprocessing for the modality - Running John Snow Labs Healthcare NLP to flag identifiers - Scrubbing all detected PHI instances - Performing text post-processing with HIPS for additional masking - Running validation to confirm successful de-identification through thorough QA/QC Through this comprehensive approach, Dandelion maintains high data utility for research clients while ensuring patient privacy and regulatory compliance. The case study demonstrates how production NLP systems for sensitive healthcare data require significant engineering beyond the core model—including secure infrastructure, domain-specific preprocessing, post-processing techniques to handle model limitations, and rigorous human-in-the-loop validation processes.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source