## Overview and Business Context
Myriad Genetics operates as a provider of genetic testing and precision medicine solutions serving healthcare providers and patients globally. Their Revenue Engineering Department handles the processing of thousands of healthcare documents on a daily basis across three major divisions: Women's Health, Oncology, and Mental Health. The operational challenge centers on classifying incoming documents into specific categories including Test Request Forms, Lab Results, Clinical Notes, and Insurance documentation to automate Prior Authorization workflows. Once classified, these documents are routed to appropriate external vendors based on their identified document class. The system also requires key information extraction including insurance details, patient information, and test results to determine Medicare eligibility and support downstream clinical and administrative processes.
The existing infrastructure combined Amazon Textract for Optical Character Recognition with Amazon Comprehend for document classification. While this solution achieved 94% classification accuracy, it suffered from significant operational constraints. The per-page cost of 3 cents translated to $15,000 in monthly expenses per business unit, representing a substantial operational burden at scale. Classification latency averaged 8.5 minutes per document, creating bottlenecks that delayed downstream prior authorization workflows. Perhaps most critically, information extraction remained entirely manual, requiring contextual understanding to differentiate nuanced clinical distinctions such as "is metastatic" versus "is not metastatic" and to locate information like insurance numbers and patient data across varying document formats. In the Women's Health business unit alone, this manual processing burden required up to 10 full-time employees contributing 78 hours daily to extraction tasks.
## Solution Architecture and Production Implementation
Myriad Genetics partnered with the AWS Generative AI Innovation Center to deploy AWS's open-source GenAI Intelligent Document Processing Accelerator. This accelerator provides a scalable, serverless architecture designed to convert unstructured documents into structured data. The architecture processes multiple documents in parallel through configurable concurrency limits, preventing downstream service overload while maintaining throughput. A built-in evaluation framework allows users to provide expected outputs through the user interface and evaluate generated results, enabling iterative customization of configuration and accuracy improvement.
The accelerator offers three pre-built deployment patterns optimized for different workloads with varying configurability, cost, and accuracy requirements. Pattern 1 uses Amazon Bedrock Data Automation, a fully managed service offering rich out-of-the-box features with straightforward per-page pricing. Pattern 2 uses Amazon Textract and Amazon Bedrock with Amazon Nova, Anthropic's Claude, or custom fine-tuned Amazon Nova models, providing flexibility for complex documents requiring custom logic. Pattern 3 combines Amazon Textract, Amazon SageMaker with fine-tuned classification models, and Amazon Bedrock for extraction, ideal for documents requiring specialized classification capabilities.
For Myriad's use case, Pattern 2 proved most suitable, meeting the critical requirement of low cost while offering flexibility to optimize accuracy through prompt engineering and LLM selection. This pattern provides no-code configuration capabilities, allowing customization of document types, extraction fields, and processing logic through configuration files editable in the web UI. Myriad customized definitions of document classes, key attributes and their definitions per document class, LLM choices, LLM hyperparameters, and both classification and extraction LLM prompts via Pattern 2's configuration file.
In production, Myriad integrated this solution into their existing event-driven architecture. Document ingestion begins when incoming order events trigger document retrieval from source document management systems, with cache optimization for previously processed documents to reduce redundant processing. Concurrency management is handled through DynamoDB tracking of concurrent AWS Step Functions jobs, while Amazon Simple Queue Service (SQS) queues files that exceed concurrency limits for orderly document processing. Text extraction leverages Amazon Textract to extract text, layout information, tables, and forms from normalized documents. Classification follows, where the configured LLM analyzes extracted content based on customized document classification prompts provided in the config file and assigns documents to appropriate categories. Key information extraction then occurs, with the configured LLM extracting medical information using extraction prompts from the config file. Finally, the pipeline formats results in a structured manner and delivers them to Myriad's Authorization System via RESTful operations.
## Document Classification: Prompt Engineering and Model Selection
While Myriad's existing solution achieved 94% accuracy, misclassifications occurred due to structural similarities, overlapping content, and shared formatting patterns across document types. This semantic ambiguity made distinguishing between similar documents challenging. The team guided Myriad on prompt optimization techniques leveraging LLM contextual understanding capabilities. This approach moved beyond simple pattern matching to enable semantic analysis of document context and purpose, identifying distinguishing features that human experts recognize but previous automated systems missed.
The implementation of AI-driven prompt engineering represented a sophisticated approach to classification improvement. The team provided document samples from each class to Anthropic Claude Sonnet 3.7 on Amazon Bedrock with model reasoning enabled—a feature allowing the model to demonstrate its step-by-step analysis process. The model identified distinguishing features between similar document classes, which Myriad's subject matter experts then refined and incorporated into the GenAI IDP Accelerator's Pattern 2 configuration file for document classification prompts. This approach demonstrates a practical application of LLM capabilities to improve prompt design through automated feature discovery.
Format-based classification strategies proved particularly effective for documents sharing comparable content but differing in structure. The team used document structure and formatting as key differentiators, enabling classification models to recognize format-specific characteristics such as layout structures, field arrangements, and visual elements. For example, lab reports and test results both contain patient information and medical data, but lab reports display numerical values in tabular format while test results follow a narrative format. The prompt instruction specified: "Lab reports contain numerical results organized in tables with reference ranges and units. Test results present findings in paragraph format with clinical interpretations." This explicit guidance on structural differences improved the model's ability to make accurate classifications based on formatting cues.
Negative prompting techniques addressed confusion between similar documents by explicitly instructing the model what classifications to avoid. This approach added exclusionary language to classification prompts, specifying characteristics that should not be associated with each document type. Initially, the system frequently misclassified Test Request Forms as Test Results due to confusion between patient medical history and lab measurements. Adding a negative prompt—"These forms contain patient medical history. DO NOT confuse them with test results which contain current/recent lab measurements"—to the TRF definition improved classification accuracy by 4%. This technique represents an important LLMOps practice for production systems: explicitly guiding models away from known error patterns improves reliability in operational environments.
Model selection represented a critical optimization decision for cost and performance at scale. The team conducted comprehensive benchmarking using the GenAI IDP Accelerator's evaluation framework, testing four foundation models: Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Premier, and Anthropic Claude Sonnet 3.7. The evaluation used 1,200 healthcare documents across three document classes (Test Request Forms, Lab Results, and Insurance), assessing each model using three critical metrics: classification accuracy, processing latency, and cost per document. The accelerator's cost tracking enabled direct comparison of operational expenses across different model configurations, ensuring performance improvements translated into measurable business value at scale.
The evaluation demonstrated that Amazon Nova Pro achieved optimal balance for Myriad's use case. Transitioning from Amazon Comprehend to Amazon Nova Pro with optimized prompts for document classification yielded significant improvements: classification accuracy increased from 94% to 98%, processing costs decreased by 77%, and processing speed improved by 80%, reducing classification time from 8.5 minutes to 1.5 minutes per document. This result illustrates an important LLMOps principle: matching the right model to specific task requirements often delivers better outcomes than defaulting to the most powerful or expensive option.
## Key Information Extraction: Multimodal Approaches and Advanced Reasoning
Myriad's manual information extraction process created substantial operational bottlenecks and scalability constraints, requiring up to 10 full-time employees contributing 78 hours daily in the Women's Health unit alone. Automating healthcare key information extraction presented distinct challenges: checkbox fields required distinguishing between different marking styles (checkmarks, X's, handwritten marks); documents contained ambiguous visual elements like overlapping marks or content spanning multiple fields; extraction needed contextual understanding to differentiate clinical distinctions and locate information across varying document formats.
Enhanced OCR configuration addressed checkbox recognition challenges. The team enabled Amazon Textract's specialized TABLES and FORMS features on the GenAI IDP Accelerator portal to improve OCR discrimination between selected and unselected checkbox elements. These features enhanced the system's ability to detect and interpret marking styles found in medical forms. Beyond OCR configuration, the team incorporated visual cues into extraction prompts, updating prompts with instructions such as "look for visible marks in or around the small square boxes (✓, x, or handwritten marks)" to guide the language model in identifying checkbox selections. This combination of enhanced OCR capabilities and targeted prompting improved checkbox extraction performance in medical forms.
However, configuring Textract and improving prompts alone proved insufficient for handling complex visual elements effectively. The team implemented a multimodal approach that sent both document images and extracted text from Textract to the foundation model, enabling simultaneous analysis of visual layout and textual content for accurate extraction decisions. This multimodal strategy represents a significant advancement over text-only approaches, allowing the model to resolve ambiguities by analyzing visual context alongside textual information.
Few-shot learning enhanced the multimodal approach by providing example document images paired with their expected extraction outputs to guide the model's understanding of various form layouts and marking styles. This technique presents a challenge in production LLM systems: multiple document image examples with their correct extraction patterns create lengthy LLM prompts, increasing both cost and latency. The team leveraged the GenAI IDP Accelerator's built-in integration with Amazon Bedrock's prompt caching feature to address this challenge. Prompt caching stores lengthy few-shot examples in memory for 5 minutes—when processing multiple similar documents within that timeframe, Bedrock reuses cached examples instead of reprocessing them, reducing both cost and processing time. This implementation demonstrates practical cost optimization in production LLM systems handling repetitive tasks.
Despite improvements from the multimodal approach, challenges remained with overlapping and ambiguous tick marks in complex form layouts. To handle ambiguous and complex situations, the team used Amazon Nova Premier and implemented chain-of-thought reasoning, having the model think through extraction decisions step-by-step using thinking tags. The prompt structure included: "Analyze the checkbox marks in this form: 1. What checkboxes are present? [List all visible options] 2. Where are the marks positioned? [Describe mark locations] 3. Which marks are clear vs ambiguous? [Assess mark quality] 4. For overlapping marks: Which checkbox contains most of the mark? 5. Are marks positioned in the center or touching edges? [Prioritize center positioning] ." Additionally, reasoning explanations were included in few-shot examples, demonstrating how conclusions were reached in ambiguous cases. This approach enabled the model to work through complex visual evidence and contextual clues before making final determinations, improving performance with ambiguous tick marks.
Testing across 32 document samples with varying complexity levels via the GenAI IDP Accelerator revealed that Amazon Textract with Layout, TABLES, and FORMS features enabled, paired with Amazon Nova Premier's advanced reasoning capabilities and the inclusion of few-shot examples, delivered the best results. The solution achieved 90% accuracy—matching human evaluator baseline accuracy—while processing documents in approximately 1.3 minutes each. This outcome is notable: the automated system matched human performance while providing consistent, scalable processing capabilities.
## Measurable Business Impact and Production Deployment
The solution delivered measurable improvements across multiple dimensions. For document classification, accuracy increased from 94% to 98% through prompt optimization techniques for Amazon Nova Pro, including AI-driven prompt engineering, document-format-based classification strategies, and negative prompting. Classification costs reduced by 77%, from 3.1 to 0.7 cents per page, by migrating from Amazon Comprehend to Amazon Nova Pro with optimized prompts. Classification time decreased by 80%, from 8.5 to 1.5 minutes per document, with Amazon Nova Pro providing a low-latency and cost-effective solution.
For the newly automated key information extraction, the system achieved 90% extraction accuracy, matching the baseline manual process. This accuracy resulted from combining Amazon Textract's document analysis capabilities, visual context learning through few-shot examples, and Amazon Nova Premier's reasoning for complex data interpretation. Processing costs of 9 cents per page and processing time of 1.3 minutes per document compare favorably to the manual baseline requiring up to 10 full-time employees working 78 hours daily per business unit.
Myriad planned a phased rollout beginning with document classification, launching the new classification solution in the Women's Health business unit, followed by Oncology and Mental Health divisions. The solution will realize up to $132K in annual savings in document classification costs. Beyond direct cost savings, the solution reduces each prior authorization submission time by 2 minutes—specialists now complete orders in four minutes instead of six minutes due to faster access to tagged documents. This improvement saves 300 hours monthly across 9,000 prior authorizations in Women's Health alone, equivalent to 50 hours per prior authorization specialist. These time savings translate to improved operational efficiency and the ability to handle increased document volumes without proportional increases in staffing.
## Critical Assessment and LLMOps Considerations
While the case study presents impressive results, several considerations warrant balanced assessment. The document originates from AWS marketing materials and naturally emphasizes positive outcomes. The 98% classification accuracy and 90% extraction accuracy represent significant improvements, though the extraction accuracy "matching human baseline" suggests that human performance on these tasks also sits at 90%—indicating inherent difficulty in the task rather than superhuman AI performance.
The cost comparison merits careful interpretation. The 77% cost reduction for classification compares Amazon Comprehend to Amazon Nova Pro, representing a transition between AWS services rather than a fundamental architectural change. Organizations using different baseline solutions might see different cost dynamics. The extraction cost of 9 cents per page, while substantially lower than manual processing costs, still represents a meaningful expense at scale—processing 9,000 documents monthly would cost approximately $8,100 in extraction costs alone.
The solution's dependency on prompt engineering represents both a strength and potential operational risk. The extensive prompt optimization—including negative prompting, format-based classification, and chain-of-thought reasoning—produced excellent results but creates ongoing maintenance requirements. As document types evolve or new edge cases emerge, prompts may require updates. The case study doesn't address prompt versioning, monitoring for prompt drift, or governance processes for prompt updates in production.
Model selection strategy demonstrates sophisticated LLMOps practice: using Amazon Nova Pro for classification (where speed and cost matter most) and Amazon Nova Premier for complex extraction (where reasoning capability justifies higher costs). However, this multi-model approach increases operational complexity. The system requires managing two different model endpoints, potentially different prompt structures, and separate performance monitoring for each model's specific tasks.
The prompt caching strategy for few-shot examples represents intelligent cost optimization, but its effectiveness depends on document processing patterns. The 5-minute cache window works well for batch processing of similar documents but provides limited benefit for sporadic processing of diverse document types. Organizations with different processing patterns might see different cost savings from this technique.
The evaluation framework integrated into the GenAI IDP Accelerator deserves recognition as a critical LLMOps capability. The ability to provide expected outputs through the UI and iteratively evaluate results enabled the rapid optimization that produced these results. However, the case study provides limited detail on ongoing monitoring and evaluation in production. Initial evaluation on 1,200 documents for classification and 32 samples for extraction represents meaningful validation, but production monitoring over time will be essential to detect performance degradation or handling of novel document types.
The phased rollout strategy (starting with Women's Health, then Oncology and Mental Health) represents responsible production deployment, allowing validation in one business unit before broader expansion. However, the case study doesn't address how the system will handle domain-specific differences across these divisions, whether separate prompts or models will be required, or how the team will manage configuration drift across business units.
Integration with existing event-driven architecture through Step Functions, SQS, and DynamoDB demonstrates production-grade engineering, with proper concurrency management and caching optimization. This architecture provides scalability and reliability but also introduces operational dependencies—the system's performance depends on the entire pipeline, not just the LLM components.
Overall, this case study represents a sophisticated LLMOps implementation with measurable business value. The combination of strategic model selection, advanced prompt engineering, multimodal processing, and prompt caching demonstrates mature understanding of production LLM systems. The significant cost savings and processing time improvements validate the approach, though ongoing operational requirements for prompt maintenance, performance monitoring, and handling edge cases will determine long-term success.