ZenML

Fine-tuning Multimodal Models for Banking Document Processing

Apoidea Group 2025
View original source

Apoidea Group tackled the challenge of efficiently processing banking documents by developing a solution using multimodal large language models. They fine-tuned the Qwen2-VL-7B-Instruct model using LLaMA-Factory on Amazon SageMaker HyperPod to enhance visual information extraction from complex banking documents. The solution significantly improved table structure recognition accuracy from 23.4% to 81.1% TEDS score, approaching the performance of more advanced models while maintaining computational efficiency. This enabled reduction of financial spreading process time from 4-6 hours to just 10 minutes.

Industry

Finance

Technologies

Overview

Apoidea Group is a Hong Kong-based AI-focused FinTech independent software vendor (ISV) that develops document processing solutions for multinational banks. Their flagship product, SuperAcc, is a sophisticated document processing service that uses proprietary document understanding models to process diverse document types including bank statements, financial statements, and KYC documents. The company has deployed their solutions with over 10 financial services industry clients, demonstrating reliability in production banking environments with claimed ROI of over 80%.

This case study documents Apoidea’s collaboration with AWS to enhance their visual information extraction capabilities by fine-tuning large vision-language models (LVLMs) for table structure recognition on banking and financial documents. The work specifically focuses on fine-tuning the Qwen2-VL-7B-Instruct model using LLaMA-Factory on Amazon SageMaker HyperPod infrastructure.

Problem Statement

The banking industry faces significant challenges with repetitive document processing tasks including information extraction, document review, and auditing. These tasks require substantial human resources and slow down critical operations such as Know Your Customer (KYC) procedures, loan applications, and credit analysis. Traditional approaches face several specific challenges:

The specific use case addressed in this case study is table structure recognition from complex financial documents, where models need to accurately identify table structures including merged cells, hierarchical headers, and varying layouts while maintaining content fidelity.

Technical Solution Architecture

Base Model Selection

The team selected Qwen2-VL-7B-Instruct as the base model for fine-tuning. This choice was significant because modern vision-language models use pre-trained vision encoders (such as vision transformers) as their backbone to extract visual features, which are then fused with textual embeddings in a multimodal transformer architecture. The pre-trained knowledge of Qwen2-VL provided a strong foundation for domain-specific adaptation. Notably, the model’s multilingual capabilities were preserved even when fine-tuning on English datasets alone, still yielding good performance on Chinese evaluation datasets.

Fine-Tuning Framework: LLaMA-Factory

LLaMA-Factory is an open-source framework designed for efficient training and fine-tuning of large language models. It supports over 100 popular models and integrates advanced techniques including:

The framework provides efficiency advantages by significantly reducing computational and memory requirements for fine-tuning large models through quantization techniques. Its modular design integrates cutting-edge algorithms like FlashAttention-2 and GaLore for high performance and scalability.

Training Infrastructure: Amazon SageMaker HyperPod

The distributed training was conducted on Amazon SageMaker HyperPod, which provides purpose-built infrastructure for training large-scale models. Key features leveraged include:

The solution used QLoRA and data parallel distributed training orchestrated through Slurm sbatch scripts. By freezing most parameters through QLoRA during initial fine-tuning stages, the team achieved faster convergence and better utilization of pre-trained knowledge, especially with limited data.

Data Preprocessing

The training data was preprocessed to use image inputs with HTML structure as the output format. HTML was chosen because:

The preprocessing is critical for the model to learn patterns of expected output format and adapt to the visual layout of tables. The team noted that model performance is highly dependent on fine-tuning data quality, with domain-specific data achieving a 5-point improvement in TEDS score with only 10% of data compared to general datasets.

Inference and Deployment

For production inference, the team uses vLLM for hosting the quantized model. vLLM provides:

The deployment process involves:

Evaluation Results

The evaluation focused on the FinTabNet dataset containing complex tables from S&P 500 annual reports. The team used Tree Edit Distance-based Similarity (TEDS) metric, which assesses both structural and content similarity between generated HTML tables and ground truth. TEDS-S specifically measures structural similarity.

The results demonstrated significant improvements:

The fine-tuned model showed a dramatic improvement from the base version (TEDS improved from 23.4 to 81.1), surpassed Claude 3 Haiku, and approached Claude 3.5 Sonnet accuracy levels while maintaining more efficient computational requirements. The structural understanding (TEDS-S of 89.7) actually exceeded Claude 3.5 Sonnet’s 87.1.

Best Practices Identified

Through experimentation, the team identified several key insights for fine-tuning multimodal table structure recognition models:

Data Quality and Domain Specificity: Fine-tuning doesn’t require massive datasets; relatively good performance was achieved with just a few thousand samples. However, imbalanced datasets lacking sufficient examples of complex elements like long tables and forms with merged cells can lead to biased performance. Maintaining balanced distribution of document types during fine-tuning ensures consistent performance across various formats.

Synthetic Data Generation: When real-world annotated data is limited, synthetic data generation using document data synthesizers can be effective. Combining real and synthetic data during fine-tuning helps mitigate out-of-domain issues, particularly for rare or domain-specific text types.

Base Model Selection: More powerful base models yield better results. The Qwen2-VL’s pre-trained visual and linguistic features provided a strong foundation, and multilingual capabilities were preserved even when fine-tuning on English datasets alone.

Security Considerations

The case study addresses security considerations essential for working with sensitive financial documents:

The solution complies with banking-grade security standards including ISO 9001 and ISO 27001.

Business Impact Claims

It should be noted that some claims in this case study come from Apoidea’s marketing materials and AWS partnership content. The claimed business impacts include:

These figures should be evaluated with appropriate skepticism as they originate from the vendor and cloud partner rather than independent verification. However, the technical approach of using fine-tuned vision-language models to replace multi-stage document processing pipelines is a sound architectural decision that addresses real inefficiencies in traditional document extraction systems.

Technical Implementation Resources

The solution provides open-source implementation available through a GitHub repository with step-by-step guidance for fine-tuning Qwen2-VL-7B-Instruct on SageMaker HyperPod, enabling organizations to adapt the approach for their own document processing challenges using domain-specific data.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI 2025

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

healthcare document_processing classification +40

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44