ZenML

Multilingual Document Processing Pipeline with Human-in-the-Loop Validation

A2I 2024
View original source

A case study on implementing a robust multilingual document processing system that combines Amazon Bedrock's Claude models with human review capabilities through Amazon A2I. The solution addresses the challenge of processing documents in multiple languages by using LLMs for initial extraction and human reviewers for validation, enabling organizations to efficiently process and validate documents across language barriers while maintaining high accuracy.

Industry

Tech

Technologies

Overview

This case study, published by AWS in November 2024, presents a reference architecture for building production-ready multilingual document processing systems using large language models. The solution addresses a significant business challenge: multinational companies frequently receive invoices, contracts, and other documents from regions worldwide in languages such as Arabic, Chinese, Russian, or Hindi that traditional document extraction software cannot handle effectively. The global intelligent document processing (IDP) market context is cited as growing from $1,285 million in 2022 to a projected $7,874 million by 2028, indicating strong market demand for such solutions.

The architecture combines Amazon Bedrock (specifically Anthropic’s Claude 3 models) for multi-modal document understanding with Amazon Augmented AI (Amazon A2I) for human-in-the-loop validation, creating a robust pipeline that balances automation with accuracy requirements for sensitive business documents.

Architecture and Pipeline Design

The solution implements a six-stage document processing pipeline orchestrated by AWS Step Functions, demonstrating a well-structured approach to LLM-powered document workflows in production:

The Acquisition stage handles document ingestion from Amazon S3, with S3 event notifications triggering the pipeline. Initial document metadata is stored in Amazon DynamoDB to enable status tracking throughout the entire processing lifecycle. This state management approach is crucial for production systems where visibility into document processing status is essential for operations and debugging.

The Extraction stage represents the core LLM integration point. Documents are embedded into prompts alongside a JSON schema definition that specifies the expected output structure. The system uses Amazon Bedrock to invoke Anthropic’s Claude models for extraction. The results are stored as JSON in S3, providing an audit trail and enabling downstream processing stages.

The Custom Business Rules stage applies domain-specific validation logic to the extracted content. Examples include table format detection (such as identifying invoice transaction tables) or column validation (verifying that product codes contain valid values). This stage acknowledges that LLM extraction alone may not be sufficient for business compliance requirements.

The Reshaping stage transforms the JSON output into a format compatible with Amazon A2I’s human review interface. This intermediate transformation step highlights the importance of data format standardization when integrating multiple services in an LLM pipeline.

The Augmentation stage routes documents to human annotators via Amazon A2I for review and correction. Human reviewers use a custom ReactJS UI to efficiently review and validate extracted information against the original document.

The Cataloging stage converts validated content into Excel workbooks for business team consumption, completing the end-to-end workflow.

LLM Integration Details

The architecture uses the Rhubarb Python framework for document understanding tasks with multi-modal LLMs. Rhubarb is described as a lightweight framework that simplifies interactions with Amazon Bedrock’s Claude V3 models. Several key technical decisions are noteworthy:

Document Format Handling: Since Claude V3 models only natively support image formats (JPEG, PNG, GIF), the framework internally handles conversion of PDF and TIF documents to compatible formats. This abstraction simplifies application code and addresses a common production challenge when working with enterprise document formats.

JSON Schema-Based Extraction: The system uses a predefined JSON schema to control LLM output structure. The provided schema example for invoice extraction includes complex nested objects (issuer, recipient information), arrays (line items), and typed fields (strings, numbers). This structured approach enables reliable downstream processing and integration with business systems.

Built-in System Prompts: Rhubarb includes system prompts that ground model responses to produce output in the defined JSON format. This prompt engineering is encapsulated within the framework, reducing the complexity for application developers.

Re-prompting and Introspection: The framework implements automatic re-prompting logic to rephrase user prompts when initial extraction attempts fail, increasing the reliability of data extraction. This retry logic is important for production systems where consistent output quality is required.

The JSON schema used for invoice extraction is comprehensive, covering invoice metadata (number, dates), party information (issuer, recipient with name, address, identifier), line items (product ID, description, quantity, unit price, discounts, tax rates, totals), and aggregate totals (subtotal, discount, tax, total). All fields include descriptions that likely help guide the LLM’s extraction logic.

Human-in-the-Loop Implementation

The human review component uses Amazon SageMaker labeling workforces and Amazon A2I to manage the review process. The setup requires creating a private workforce with two worker teams (“primary” and “quality”), suggesting a tiered review process for quality assurance.

The custom ReactJS UI displays the original document alongside extracted content, enabling reviewers to validate and correct extraction errors efficiently. This side-by-side comparison approach is essential for effective human review of document extraction results.

The integration with Amazon A2I handles workflow management, task distribution, and reviewer coordination—capabilities that would be complex to build from scratch and are described as “managing the heavy lifting associated with developing these systems.”

Infrastructure and Deployment

The solution is deployed using AWS CDK (Cloud Development Kit), providing infrastructure-as-code for reproducible deployments. The deployment creates:

The CDK approach enables version-controlled infrastructure and consistent deployments across environments, which is essential for production LLMOps practices.

Operational Considerations

The architecture includes several production-ready features:

State Tracking: DynamoDB stores document processing status, enabling monitoring and debugging. Documents move through stages like “Augment#Running” to indicate their current pipeline position.

Resilient Pipeline: The use of Step Functions for orchestration provides built-in retry logic, error handling, and visibility into workflow execution.

Audit Trail: Documents and extracted data are stored in S3 at each stage, providing traceability for compliance requirements.

Clean-up Guidance: The documentation includes explicit instructions for resource cleanup to avoid ongoing charges, acknowledging the cost management requirements for cloud-based LLM systems.

Limitations and Balanced Assessment

While the case study presents a comprehensive architecture, several considerations warrant attention:

The solution is a reference architecture and technical demonstration rather than a documented production deployment. No specific performance metrics, accuracy rates, or production statistics are provided. The human review step, while adding accuracy, also introduces latency and cost that may not be suitable for all use cases.

The system’s effectiveness depends on the quality of the JSON schema definitions and business rules, which require domain expertise to develop. The framework’s “re-prompting” logic for failed extractions may increase API costs and latency.

The architecture is tightly coupled to AWS services, creating vendor lock-in considerations. Organizations should evaluate whether this aligns with their multi-cloud or hybrid strategies.

Extensibility

The case study suggests the framework could be extended by connecting to a knowledge base for indexing extracted information and creating a Q&A assistant for information discovery. This indicates a path toward RAG (Retrieval-Augmented Generation) applications built on the extracted document content.

The use of Rhubarb framework features beyond extraction—including document classification, summarization, page-wise extraction, streaming chat, and named entity recognition—suggests opportunities for expanding the solution’s capabilities without significant architectural changes.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling Generative AI Features to Millions of Users with Infrastructure Optimization and Quality Evaluation

Slack 2025

Slack faced significant challenges in scaling their generative AI features (Slack AI) to millions of daily active users while maintaining security, cost efficiency, and quality. The company needed to move from a limited, provisioned infrastructure to a more flexible system that could handle massive scale (1-5 billion messages weekly) while meeting strict compliance requirements. By migrating from SageMaker to Amazon Bedrock and implementing sophisticated experimentation frameworks with LLM judges and automated metrics, Slack achieved over 90% reduction in infrastructure costs (exceeding $20 million in savings), 90% reduction in cost-to-serve per monthly active user, 5x increase in scale, and 15-30% improvements in user satisfaction across features—all while maintaining quality and enabling experimentation with over 15 different LLMs in production.

customer_support chatbot question_answering +37

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57