Finance
Apoidea Group
Company
Apoidea Group
Title
Fine-tuning Multimodal Models for Banking Document Processing
Industry
Finance
Year
2025
Summary (short)
Apoidea Group tackled the challenge of efficiently processing banking documents by developing a solution using multimodal large language models. They fine-tuned the Qwen2-VL-7B-Instruct model using LLaMA-Factory on Amazon SageMaker HyperPod to enhance visual information extraction from complex banking documents. The solution significantly improved table structure recognition accuracy from 23.4% to 81.1% TEDS score, approaching the performance of more advanced models while maintaining computational efficiency. This enabled reduction of financial spreading process time from 4-6 hours to just 10 minutes.
This case study showcases how Apoidea Group, a Hong Kong-based FinTech ISV, implemented and operationalized multimodal large language models to revolutionize document processing in the banking sector. The case study provides valuable insights into the practical challenges and solutions of deploying LLMs in a highly regulated industry. The company's core challenge was to improve the efficiency of processing various banking documents including bank statements, financial statements, and KYC documents. Traditional manual processing was time-consuming, taking 4-6 hours for financial spreading and requiring significant human resources. The banking industry's strict security and regulatory requirements added additional complexity to implementing AI solutions. Apoidea Group's solution, SuperAcc, leverages advanced generative AI and deep learning technologies. The system demonstrated impressive results, reducing processing time to 10 minutes with just 30 minutes of staff review time needed. This led to an ROI of over 80% and has been successfully deployed across more than 10 financial services industry clients. The technical implementation focused on fine-tuning the Qwen2-VL-7B-Instruct multimodal model using several key technologies and approaches: * Infrastructure and Training: * Utilized Amazon SageMaker HyperPod for distributed training * Implemented LLaMA-Factory framework for efficient fine-tuning * Used QLoRA (Quantized Low-Rank Adaptation) to reduce computational requirements * Employed data parallel distributed training * Integrated with Slurm for cluster management and job scheduling * Model Development and Fine-tuning: * Preprocessed data using image inputs and HTML outputs for table structure * Applied specialized data preparation techniques for financial document formats * Used domain-specific datasets to improve recognition of banking terminology * Maintained multilingual capabilities through careful fine-tuning approaches * Production Deployment: * Deployed using vLLM for efficient memory management and optimized inference * Implemented 4-bit quantization to reduce model size while maintaining accuracy * Exposed the model through RESTful APIs for integration with existing systems * Established comprehensive security measures including encryption and access controls The production implementation included several security considerations crucial for financial services: * Data encryption at rest using AWS KMS * TLS encryption for data in transit * Strict S3 bucket policies with VPC endpoints * Least-privilege access controls through IAM roles * Private subnet deployment in dedicated VPCs * API Gateway protections and token-based authentication Key performance metrics demonstrated the effectiveness of their approach: * Improved TEDS (Tree Edit Distance-based Similarity) score from 23.4% to 81.1% * Enhanced structural similarity (TEDS-S) from 25.3% to 89.7% * Performance comparable to more advanced models like Claude 3.5 Sonnet The team identified several best practices for fine-tuning multimodal models in production: * Data Quality and Balance: * Using domain-specific data yielded better results than general datasets * Achieved good performance with relatively small but well-curated datasets * Maintained balanced distribution of document types for consistent performance * Used synthetic data generation to supplement limited real-world samples * Model Selection and Training: * Choice of base model significantly impacted final performance * QLoRA helped achieve faster convergence while preserving pre-trained knowledge * Preserved multilingual capabilities despite English-focused fine-tuning * Production Optimization: * Implemented efficient batch processing * Optimized memory allocation for high-throughput inference * Established monitoring and evaluation metrics for production performance The success of this implementation demonstrates the practical viability of using fine-tuned multimodal models for complex document processing tasks in highly regulated environments. The case study shows how careful attention to infrastructure, security, and optimization can lead to successful deployment of advanced LLM capabilities in production settings. Particularly noteworthy is how the team balanced the competing demands of model performance, computational efficiency, and regulatory compliance. Their approach to fine-tuning and deployment provides a blueprint for other organizations looking to implement similar solutions in regulated industries. The case study also highlights the importance of choosing appropriate tools and frameworks for different stages of the MLOps pipeline. The combination of LLaMA-Factory for training, SageMaker HyperPod for infrastructure, and vLLM for deployment created an efficient and maintainable production system that met both technical and business requirements.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.