Uber: GenAI-Powered Invoice Document Processing and Automation

Company

Uber

Title

GenAI-Powered Invoice Document Processing and Automation

Industry

Tech

Link

https://www.uber.com/en-IN/blog/advancing-invoice-document-processing-using-genai/

Year

2025

Summary (short)

Uber faced significant challenges processing a high volume of invoices daily from thousands of global suppliers, with diverse formats, 25+ languages, and varying templates requiring substantial manual intervention. The company developed TextSense, a GenAI-powered document processing platform that leverages OCR, computer vision, and large language models (specifically OpenAI GPT-4 after evaluating multiple options including fine-tuned Llama 2 and Flan T5) to automate invoice data extraction. The solution achieved 90% overall accuracy, reduced manual processing by 2x, cut average handling time by 70%, and delivered 25-30% cost savings compared to manual processes, while providing a scalable, configuration-driven platform adaptable to diverse document types.

## Overview Uber developed a comprehensive GenAI-powered invoice automation system to address critical inefficiencies in processing invoices from thousands of global suppliers. The company handles massive invoice volumes daily, and the traditional approach relied heavily on manual data entry, RPA (Robotic Process Automation), Excel uploads, and rule-based systems. While Uber had existing automation solutions, a significant portion of invoices still required manual handling, leading to high operational costs, increased average handling time, and error-prone processes. The company recognized that existing tools lacked the adaptability and intelligence needed to handle diverse invoice formats spanning multiple languages, templates, and structures. The solution centers on TextSense, a modular and scalable document processing platform that abstracts OCR and LLM technologies into a reusable interface. This platform was designed specifically with LLMOps principles in mind, emphasizing accuracy, scalability, flexibility, and user experience. The implementation demonstrates sophisticated production use of LLMs with careful model evaluation, human-in-the-loop review processes, accuracy tracking mechanisms, and extensive post-processing validation layers. ## Business and Technical Context Uber's invoice processing challenge was multifaceted. From a business perspective, the company faced high average handling times for operators processing invoices, significant operational costs from manual processing, and heightened risk of errors leading to financial discrepancies and reconciliation challenges. The technical challenges were equally complex: invoices arrived from thousands of suppliers using varying templates and formats, in over 25 languages, often with handwritten text or as scanned copies. Each invoice contained 15-20 attributes plus line item information requiring accurate capture, with many invoices spanning multiple pages. The existing tools, including rule-based systems and RPA, proved inadequate for several reasons. While RPA automation worked when dealing with a limited set of formats, it didn't scale well as Uber grew and onboarded new document formats. These systems required continual updates and manual intervention for error correction, lacked flexibility, and struggled to maintain performance when processing high volumes of invoices. The company needed a solution that could adapt to new and diverse invoice formats without requiring manual rule-setting for each variation. ## Architecture and Platform Design TextSense was architected as a modular and pluggable platform designed to scale for diverse use cases beyond just invoices, including entity extraction, summarization, and classification. The design philosophy emphasized configuration-driven integration with minimal coding, allowing new country-specific templates to be onboarded significantly faster. To manage the nonlinear and verbose document processing workflows efficiently, Uber integrated their platform with Cadence, Uber's workflow orchestration system. The platform was built with common and reusable components to facilitate future integrations and launches. The document processing pipeline follows a systematic flow starting with document ingestion from multiple sources including emails, PDFs, and ticketing systems. All files are saved in object storage platforms and the system supports both structured and unstructured data formats. The pre-processing stage includes image augmentation to handle low-resolution scans and handwritten texts, format conversion standardization for PDFs, Word documents, and images, and multi-page document handling. Computer vision and OCR integration leverages Uber's Vision Gateway CV platform for optical character recognition to extract text from document images. The AI and ML model layer is where the core LLM functionality resides. The platform leverages trained or pre-trained LLM models for extracting specific data elements like invoice numbers, dates, and amounts. Critically for LLMOps, the system continuously improves through periodic re-training and feedback loops to address accuracy issues and adapt to new document formats. Post-processing and integration applies business rules and user-defined steps to refine extracted data before final use, then integrates with client systems for further processing and payment actions, enabling end-to-end automation. ## Model Evaluation and Selection Process Uber conducted a rigorous model evaluation process that demonstrates mature LLMOps practices. The evaluation started with data preparation using past invoice data and associated attachments as ground truth. The company worked with two datasets: structured labeled data containing the invoice fields they wanted to extract (data entered into systems), and unstructured PDF data consisting of extracted text from the associated invoice PDF documents. They used the last year of invoice data, splitting it 90% for training and 10% for testing. Multiple LLM models were fine-tuned and evaluated, including sequence-to-sequence models, Meta Llama 2, and Google Flan T5. The T5 model showed promise with accuracy over 90% for invoice header fields, but it didn't perform well predicting line item information. While the first-line accuracy was good, there was a considerable drop in accuracy for predicting the second line onwards. The fine-tuning approach helped the models understand data patterns and multiple business rules from existing invoices, but this also led to hallucinations, especially for line item information. The evaluation then turned to OpenAI GPT-4 models, which demonstrated better performance in accuracy and adaptability. While the fine-tuned open-source models were better at detecting existing invoice data patterns, GPT-4 excelled at predicting what was actually available in the documents. Based on cost-benefit analysis, GPT-4 was chosen as the winner. Even though the fine-tuned LLM model had slightly higher header accuracy, GenAI (GPT-4) was substantially better for line prediction. The company notes that they plan to follow an ensemble approach in the future, implementing chaining of more sophisticated models to further enhance accuracy and adaptability for broader use cases. ## LLMOps Implementation Details The invoicing workflow integration with TextSense demonstrates production-grade LLMOps architecture. Documents enter the system through two pathways: manual PDF uploads through a front-end web app service that sends requests to a common back-end endpoint, and automated ingestion from the ticketing system where an ingestion service reads open tickets and extracts supplier emails with associated PDFs. For ticket-based ingestion, the email text is passed to TextSense for parsing key information that helps in further processing, while PDFs are sent along with details from the email text to the common back-end endpoint. Once TextSense extracts the response, a critical post-processing layer validates the information, enriches the extracted data, and prepares it for human review. This post-processing layer was specifically designed to apply business logic before showing data to users for human-in-the-loop review. Upon review and approval, documents are processed as invoices and sent to the ERP system for approval and vendor payments. This architecture demonstrates a mature understanding that raw LLM outputs require validation and business rule application before being production-ready. ## Data Profiling and Continuous Improvement Data profiling plays a crucial role in the continuous improvement cycle, which is essential for production LLMOps. By analyzing their supplier base, Uber discovered that many invoices come from a small subset of suppliers. High-volume suppliers with significant yearly invoice volumes are prioritized for profiling, particularly when their field-level accuracy falls below a set threshold. This insight informed the prioritization strategy for ML model development and deployment. Suppliers whose field-level accuracy falls below the threshold are targeted for labeling, allowing the model to learn and improve extraction precision. Key invoice fields like invoice number, date, and amount are labeled with care. Accurate and consistent labels optimize the in-house-trained model's understanding of invoice structures, leading to more reliable data extraction. This demonstrates a data-driven approach to model improvement rather than a one-time deployment model. ## Accuracy Measurement and Monitoring Calculating the performance of GenAI models in production requires sophisticated metrics, and Uber developed a comprehensive approach. They calculate accuracy at both the header level (overall invoice information) and the line level (individual line items within the invoice). Accuracy for each field is determined based on the specific type of match required—some fields require exact matches (like invoice number), while others allow for fuzzy string matching (like invoice description). The accuracy metrics are designed to provide granular insights into model performance, enabling the identification of areas for improvement and guiding model retraining efforts. The company tracks accuracy trends over time to ensure models continue to perform effectively as the invoice processing workload evolves. This level of monitoring sophistication indicates mature production LLMOps practices, recognizing that model performance isn't static and requires ongoing measurement and adjustment. ## Human-in-the-Loop Design The UI design for human-in-the-loop review demonstrates thoughtful consideration of the operator experience. Users can perform side-by-side comparisons of PDF data versus data extracted from the models. The interface includes multiple alerts and soft warning messages, consolidating all information in one place for user review. This design philosophy enables users to review all details with simple eye movements rather than hand movements, significantly accelerating the review process while maintaining accuracy through human oversight. The HITL approach reflects a balanced understanding that full automation isn't always appropriate or possible, especially in financial operations where accuracy is paramount. By designing intuitive review interfaces and consolidating validation information, Uber enables efficient human oversight while still capturing the efficiency gains from automation. The system achieved data validation and extraction accuracy through cross-referencing with existing databases or predefined rules, with HITL validation reserved for critical reviews and corrections. ## Production Results and Impact The implementation yielded substantial measurable results that validate the LLMOps approach. Manual processing was reduced by 2x, representing a significant operational efficiency gain. The overall accuracy rate reached 90%, with 35% of submitted invoices achieving near-perfect accuracy of 99.5% and 65% achieving more than 80% accuracy. Average handling time for invoice processing was reduced by 70%, which translates directly to cost savings and faster processing cycles. The solution delivered 25-30% cost savings compared to manual processes. Beyond raw metrics, the solution improved user experience through smarter data extraction from PDFs, effective post-processing rules, intuitive UI design, and robust integration with ERP systems enabling seamless invoice creation and vendor payment. These results established a new benchmark for operational excellence within Uber's financial operations management. ## Critical Assessment and Considerations While the case study presents impressive results, some considerations warrant balanced assessment. The comparison between fine-tuned open-source models and GPT-4 focused primarily on accuracy metrics, but the case study doesn't provide detailed cost analysis comparing the ongoing operational costs of using GPT-4 versus self-hosted fine-tuned models. For organizations processing high volumes of documents, API costs for proprietary models can be substantial and should factor into total cost of ownership calculations. The hallucination issues observed with fine-tuned models, particularly for line item prediction, represent a common challenge in production LLM systems. While the post-processing layer addresses this by applying business logic validation, this adds complexity and may not catch all errors. The 90% overall accuracy, while impressive, still means 10% of extractions require correction, and even the "near-perfect" 99.5% accuracy for 35% of invoices leaves room for errors in high-stakes financial operations. The reliance on a proprietary model (GPT-4) introduces vendor dependency and potential cost volatility as OpenAI's pricing structure changes. The planned ensemble approach could mitigate this by allowing the system to route different invoice types to different models based on performance and cost characteristics, but this adds architectural complexity. ## Future Directions and Platform Evolution Looking ahead, Uber plans several enhancements that demonstrate continued investment in the LLMOps capabilities. They aim to further improve accuracy, expand capabilities, and build a document classification layer that will classify documents according to particular types. The company plans to enable fully automated end-to-end processing for cases where 100% accuracy is met historically, reducing manual interventions and speeding up workflows for the most predictable invoice types. Through regular feedback loops and performance monitoring, TextSense will continue to evolve, incorporating new developments in AI technology. Future updates aim to expand the platform's ability to process additional document types beyond invoices and further integrate with other enterprise systems, positioning TextSense as a versatile tool for comprehensive document management across Uber's operations. This evolution from invoice-specific to general-purpose document processing demonstrates the value of building reusable, modular platforms rather than point solutions. The commitment to configuration-driven onboarding of new document types, combined with the modular architecture, positions the platform well for expansion. However, maintaining model performance across increasingly diverse document types will require careful monitoring and potentially more sophisticated routing logic to ensure the right models handle the right documents. The LLMOps practices established for invoice processing—systematic evaluation, accuracy tracking, HITL review, continuous retraining, and post-processing validation—provide a solid foundation for expanding to other document processing use cases while maintaining production quality standards.

Start deploying reproducible AI workflows today