A2I: Multilingual Document Processing Pipeline with Human-in-the-Loop Validation

Overview

This case study, published by AWS in November 2024, presents a reference architecture for building production-ready multilingual document processing systems using large language models. The solution addresses a significant business challenge: multinational companies frequently receive invoices, contracts, and other documents from regions worldwide in languages such as Arabic, Chinese, Russian, or Hindi that traditional document extraction software cannot handle effectively. The global intelligent document processing (IDP) market context is cited as growing from $1,285 million in 2022 to a projected $7,874 million by 2028, indicating strong market demand for such solutions.

The architecture combines Amazon Bedrock (specifically Anthropic’s Claude 3 models) for multi-modal document understanding with Amazon Augmented AI (Amazon A2I) for human-in-the-loop validation, creating a robust pipeline that balances automation with accuracy requirements for sensitive business documents.

Architecture and Pipeline Design

The solution implements a six-stage document processing pipeline orchestrated by AWS Step Functions, demonstrating a well-structured approach to LLM-powered document workflows in production:

The Acquisition stage handles document ingestion from Amazon S3, with S3 event notifications triggering the pipeline. Initial document metadata is stored in Amazon DynamoDB to enable status tracking throughout the entire processing lifecycle. This state management approach is crucial for production systems where visibility into document processing status is essential for operations and debugging.

The Extraction stage represents the core LLM integration point. Documents are embedded into prompts alongside a JSON schema definition that specifies the expected output structure. The system uses Amazon Bedrock to invoke Anthropic’s Claude models for extraction. The results are stored as JSON in S3, providing an audit trail and enabling downstream processing stages.

The Custom Business Rules stage applies domain-specific validation logic to the extracted content. Examples include table format detection (such as identifying invoice transaction tables) or column validation (verifying that product codes contain valid values). This stage acknowledges that LLM extraction alone may not be sufficient for business compliance requirements.

The Reshaping stage transforms the JSON output into a format compatible with Amazon A2I’s human review interface. This intermediate transformation step highlights the importance of data format standardization when integrating multiple services in an LLM pipeline.

The Augmentation stage routes documents to human annotators via Amazon A2I for review and correction. Human reviewers use a custom ReactJS UI to efficiently review and validate extracted information against the original document.

The Cataloging stage converts validated content into Excel workbooks for business team consumption, completing the end-to-end workflow.

LLM Integration Details

The architecture uses the Rhubarb Python framework for document understanding tasks with multi-modal LLMs. Rhubarb is described as a lightweight framework that simplifies interactions with Amazon Bedrock’s Claude V3 models. Several key technical decisions are noteworthy:

Document Format Handling: Since Claude V3 models only natively support image formats (JPEG, PNG, GIF), the framework internally handles conversion of PDF and TIF documents to compatible formats. This abstraction simplifies application code and addresses a common production challenge when working with enterprise document formats.

JSON Schema-Based Extraction: The system uses a predefined JSON schema to control LLM output structure. The provided schema example for invoice extraction includes complex nested objects (issuer, recipient information), arrays (line items), and typed fields (strings, numbers). This structured approach enables reliable downstream processing and integration with business systems.

Built-in System Prompts: Rhubarb includes system prompts that ground model responses to produce output in the defined JSON format. This prompt engineering is encapsulated within the framework, reducing the complexity for application developers.

Re-prompting and Introspection: The framework implements automatic re-prompting logic to rephrase user prompts when initial extraction attempts fail, increasing the reliability of data extraction. This retry logic is important for production systems where consistent output quality is required.

The JSON schema used for invoice extraction is comprehensive, covering invoice metadata (number, dates), party information (issuer, recipient with name, address, identifier), line items (product ID, description, quantity, unit price, discounts, tax rates, totals), and aggregate totals (subtotal, discount, tax, total). All fields include descriptions that likely help guide the LLM’s extraction logic.

Human-in-the-Loop Implementation

The human review component uses Amazon SageMaker labeling workforces and Amazon A2I to manage the review process. The setup requires creating a private workforce with two worker teams (“primary” and “quality”), suggesting a tiered review process for quality assurance.

The custom ReactJS UI displays the original document alongside extracted content, enabling reviewers to validate and correct extraction errors efficiently. This side-by-side comparison approach is essential for effective human review of document extraction results.

The integration with Amazon A2I handles workflow management, task distribution, and reviewer coordination—capabilities that would be complex to build from scratch and are described as “managing the heavy lifting associated with developing these systems.”

Infrastructure and Deployment

The solution is deployed using AWS CDK (Cloud Development Kit), providing infrastructure-as-code for reproducible deployments. The deployment creates:

S3 buckets for document storage at various pipeline stages
Lambda functions for ML service integration and business logic
IAM policies for security and access control
SQS queues for message handling
Step Functions state machine for orchestration
A2I human review workflow configuration

The CDK approach enables version-controlled infrastructure and consistent deployments across environments, which is essential for production LLMOps practices.

Operational Considerations

The architecture includes several production-ready features:

State Tracking: DynamoDB stores document processing status, enabling monitoring and debugging. Documents move through stages like “Augment#Running” to indicate their current pipeline position.

Resilient Pipeline: The use of Step Functions for orchestration provides built-in retry logic, error handling, and visibility into workflow execution.

Audit Trail: Documents and extracted data are stored in S3 at each stage, providing traceability for compliance requirements.

Clean-up Guidance: The documentation includes explicit instructions for resource cleanup to avoid ongoing charges, acknowledging the cost management requirements for cloud-based LLM systems.

Limitations and Balanced Assessment

While the case study presents a comprehensive architecture, several considerations warrant attention:

The solution is a reference architecture and technical demonstration rather than a documented production deployment. No specific performance metrics, accuracy rates, or production statistics are provided. The human review step, while adding accuracy, also introduces latency and cost that may not be suitable for all use cases.

The system’s effectiveness depends on the quality of the JSON schema definitions and business rules, which require domain expertise to develop. The framework’s “re-prompting” logic for failed extractions may increase API costs and latency.

The architecture is tightly coupled to AWS services, creating vendor lock-in considerations. Organizations should evaluate whether this aligns with their multi-cloud or hybrid strategies.

Extensibility

The case study suggests the framework could be extended by connecting to a knowledge base for indexing extracted information and creating a Q&A assistant for information discovery. This indicates a path toward RAG (Retrieval-Augmented Generation) applications built on the extracted document content.

The use of Rhubarb framework features beyond extraction—including document classification, summarization, page-wise extraction, streaming chat, and named entity recognition—suggests opportunities for expanding the solution’s capabilities without significant architectural changes.

Multilingual Document Processing Pipeline with Human-in-the-Loop Validation

Industry

Technologies

Overview

Architecture and Pipeline Design

LLM Integration Details

Human-in-the-Loop Implementation

Infrastructure and Deployment

Operational Considerations

Limitations and Balanced Assessment

Extensibility

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Scaling Generative AI Features to Millions of Users with Infrastructure Optimization and Quality Evaluation

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce