Ramp, a financial technology company, developed a comprehensive benchmarking framework to evaluate the performance of large language models across six critical financial automation tasks including invoice OCR, financial statement extraction, policy compliance, accounting autocoding, partner restrictions compliance, and smart routing. The company systematically tested 13+ models from Anthropic, Google, and OpenAI across different reasoning configurations to optimize for the specific tradeoffs of accuracy, cost, and latency for each use case. Results showed that no single model dominated across all tasks—Gemini 3 Flash excelled at visual extraction tasks with superior cost-efficiency, while Claude Opus 4.6 achieved the highest overall intelligence, and different models proved optimal for different financial workflows depending on whether the priority was precision, cost, latency, or coverage.
Ramp, a financial technology company providing expense management and accounts payable software, has built a sophisticated LLMOps evaluation framework to continuously benchmark large language models across six production financial automation tasks. This case study exemplifies mature LLMOps practices where systematic evaluation drives model selection, deployment decisions, and continuous improvement. The company emphasizes that benchmarking is critical not just for quantifying customer value, but also for maintaining internal development velocity while ensuring trust and reliability—a key tension in production LLM systems where “shipping fast” must be balanced against “not breaking things.”
The case study presents detailed performance analyses across contextual invoice OCR, financial statement OCR, policy agent decision-making, accounting autocoding, partner restrictions compliance, and fund smart routing. What makes this particularly notable from an LLMOps perspective is the systematic approach to evaluation across multiple dimensions (accuracy, cost, latency) and the recognition that different tasks require different model choices based on their specific constraints and failure mode tolerances.
Ramp’s evaluation framework represents a production-grade approach to LLMOps where continuous testing enables rapid iteration without compromising reliability. The company tested 13+ models from major providers (Anthropic’s Claude family, Google’s Gemini family, OpenAI’s GPT family, and some open-weight models like Minimax 2.5 and GLM 5) across multiple reasoning configurations to understand how inference-time compute affects performance on real-world tasks.
A critical aspect of their methodology is the use of Pareto frontier analysis to visualize the tradeoffs between competing objectives. For most tasks, they plot models along dimensions like accuracy versus cost, or false positives versus false negatives, identifying which models sit on the Pareto frontier (the set of solutions where you cannot improve one metric without degrading another). This approach acknowledges that in production LLMOps, there is rarely a single “best” model—instead, the optimal choice depends on the specific constraints and priorities of each use case.
The framework uses real production data and ground truth from actual customer interactions, which provides more realistic evaluation than synthetic benchmarks. However, the case study also acknowledges challenges in establishing ground truth for certain tasks, particularly the policy agent where human decisions may themselves be inconsistent or subjective.
The first production use case involves extracting invoice-level fields and line items from uploaded invoices, which can be hundreds of pages long, for Ramp’s Accounts Payable software. The key innovation here is “contextual OCR” that learns patterns from a business’s historical submitted bills to improve extraction quality. For example, if a business consistently forms invoice numbers by concatenating “INV” with the invoice date, the system should learn this pattern and auto-populate correctly even when the invoice number isn’t explicitly found in the document.
This differs fundamentally from classic OCR benchmarks because the extraction must be contextual and adaptive to user preferences rather than simply focused on reading text from challenging images. The evaluation metric is Perfect Extraction Rate—the percentage of invoices where every extracted field matches the user’s final submitted bill with zero edits required. Common failure modes include ambiguity around dates and incorrect derivation of fields that depend on other extracted fields.
The results showed that Gemini 3 Flash achieved exceptional performance on this task with a perfect extraction rate nearly matching GPT 5.1 with high reasoning at less than one-third the cost. This represents a critical LLMOps insight: for visual document understanding tasks, the latest Gemini models offer superior cost-efficiency. Additionally, inference-time compute (higher reasoning efforts) provided significant performance lifts across almost all models, suggesting that for this task, allowing models more time to reason improves accuracy substantially.
Financial statement extraction presents different challenges than invoice OCR. As part of Ramp’s risk and underwriting process, customers upload financial statements from which the system extracts key metrics like revenue, net income, EBITDA, and total expenses. The challenge is that financial statements are far less standardized than invoices—line items vary in naming across companies, derived metrics like EBITDA may not appear explicitly, and even expert humans reviewing the same P&L statement with different accounting treatments may arrive at slightly different answers.
The evaluation used over 500 real P&L documents with ground truth from Ramp’s risk operators. The primary metric was match rate at a 1% relative error threshold per extracted value, with secondary analysis at 5% and 10% error thresholds to characterize the magnitude of errors when models are wrong. This multi-threshold approach provides nuanced insight into failure modes—some models might be precise when correct but make occasional large errors, while others might consistently produce small errors.
The Pareto frontier analysis revealed that Gemini 2.5 Flash and 3 Flash dominated the cost-efficient end, while Claude Sonnet 4.6, Opus 4.5, and Opus 4.6 owned the high-accuracy end. Gemini 3 Flash emerged as the clear value leader, nearly matching Claude Sonnet 4.6’s accuracy at roughly a quarter of the cost. However, when examining looser error tolerances (10%), the rankings shifted—Gemini 3 Flash pulled ahead of every Claude model, suggesting different failure mode characteristics between providers. Claude models proved more precise (fewer small errors), while Gemini models were better at limiting large misses when wrong. This kind of nuanced failure mode analysis is essential for production LLMOps where understanding not just whether a model fails but how it fails informs risk management and user experience design.
The policy agent addresses the challenge of reviewing expenses to determine whether they comply with company expense policies—a painstakingly manual process that’s often overwhelming and error-prone. The LLM-powered agent suggests either approving or rejecting expenses, or outputs “unsure” when it lacks required context to make a confident decision.
This task presents a classic LLMOps evaluation challenge: the lack of clear ground truth. Human expense reviewers themselves may have blind spots and varying interpretations of policies, so simply comparing against human decisions is insufficient. Ramp’s solution is to decompose the overall objective into two metrics that pull in opposite directions: disagreement rate (percentage of transactions where the model disagrees with the human decision) and unsure rate (percentage of transactions forwarded to humans for review).
This creates a natural tension that’s common in production AI systems—the model could achieve very low disagreement by being extremely conservative and saying “unsure” most of the time, but that wouldn’t provide much value to customers. The Pareto frontier approach allows Ramp to visualize this tradeoff and choose the operating point that delivers the best product experience.
An interesting finding was the significant behavioral variation among models from the same provider. GPT-5.1 and GPT-5.4, both from OpenAI, behaved in opposite ways—GPT-5.1 was often confident but wrong, while GPT-5.4 was uncertain but accurate. This highlights an important LLMOps consideration: model behavior can vary substantially even within a single provider’s model family, and newer models don’t always improve in the specific dimensions you care about. Most models sat at or near the Pareto frontier, meaning the choice comes down to selecting the right tradeoff point rather than finding a clearly superior model.
Accounting coding—categorizing company expenses under appropriate categories like Marketing Spend, Travel, or Research and Development—is a typical bookkeeping task that suffers from a knowledge gap. Bookkeepers understand the nuanced differences between semantically similar categories (e.g., “Travel: Client Billable” versus “General: Travel”) but lack context on specific expenses, while spenders have context but may not understand accounting category distinctions. LLMs with access to user and transactional context can bridge this gap.
The challenges include cold start problems (distinguishing between semantically similar categories without historical usage examples) and balancing pattern adherence versus handling one-off exceptions (new categories are often introduced as companies grow, requiring quick adaptation while distinguishing genuine pattern changes from exceptional cases).
The evaluation metric is Accuracy@K for K=1 (top-1 accuracy) and K=3 (whether the correct category appears in the top three suggestions). Success requires both adherence to historical patterns when appropriate and accurate semantic understanding plus reasoning when patterns are wrong or absent. Latency is also critical given the high throughput—autocoding runs across multiple categories for every transaction Ramp processes at different points in the transaction lifecycle.
Gemini 3 models led across both accuracy and latency dimensions, with Anthropic models matching performance at higher latencies. The toggle to Top-3 accuracy revealed substantial accuracy jumps for every model, suggesting that the correct category is frequently in the model’s consideration set even when it isn’t the first choice. This insight has UX implications—surfacing top-3 predictions to users could significantly improve the user experience even when top-1 accuracy is imperfect.
Partnerships with banks, lenders, and other financial institutions enable many of Ramp’s core offerings, requiring adherence to each partner’s compliance requirements and precise determination of customer eligibility for different products. Ramp built an extensible framework using LLMs to determine which restrictions apply to each customer and partner combination.
The evaluation uses a custom Compliance Adherence Score factoring in three real-world requirements: correctly identifying restricted businesses (avoiding false negatives, the most critical component since a miss means a compliance failure), covering the full breadth of restriction categories (protecting against blind spots in rare categories), and avoiding false positives (preventing excessive operational burden from over-flagging).
The Detection Error Tradeoff (DET) plot showed that the Pareto frontier ran through open-weight models (Minimax 2.5 and GLM 5) at the low false-positive end and Anthropic models at the low false-negative end. Google models sat just off the frontier—Gemini 3 Flash nearly matched Opus 4.6’s miss rate but at over three times the false positive cost. OpenAI’s GPT-5.4 showed improved false-positive discipline matching Sonnet 4.6’s rate, but its higher miss rate kept it off the frontier.
However, when incorporating the third component of the score (category coverage across restriction types), Anthropic models extended their lead, followed closely by Google whose strong category coverage boosted their ranking. The open-weight models fell significantly behind in coverage despite their strong performance on the false positive versus false negative tradeoff. This illustrates a sophisticated LLMOps principle: multi-dimensional evaluation that considers not just aggregate accuracy but performance distribution across different categories or edge cases is essential for production systems, particularly in compliance contexts where missing rare but critical cases can have serious consequences.
Smart routing uses an LLM agent to choose the most appropriate fund for routing transactions, augmented with information about the transaction, the funds, and previous examples of routing failures for that business if they exist. The evaluation challenge is that collecting reliable human feedback is difficult because incorrect routing is often not corrected by users.
Ramp’s solution was to hand-curate a dataset of samples with reliable ground truth funds that are non-trivial to route correctly. The metric is override rate (lower is better)—the percentage of cases where the model routes incorrectly and a user would have to override the decision. Results showed fairly consistent performance across models without significant improvement from more reasoning.
Given this relatively flat performance landscape, Ramp prioritized cost and latency optimization and focused on augmenting the model with proper tools. An important finding was that grounding the smart routing agent with web search demonstrated considerable performance improvements across all models. This highlights a crucial LLMOps insight: when model selection yields minimal differentiation, the LLMOps focus should shift to augmentation techniques (tool use, retrieval, better context provision) that can improve all models rather than continuing to search for a better base model.
A central theme throughout this case study is that there is no single “best” model for all tasks. While Claude Opus 4.6 “holds the crown for overall intelligence,” different models have application-specific strengths. Gemini models dominate visual tasks (invoice OCR, financial statement extraction), while Anthropic models excel at compliance tasks requiring high precision and category coverage. When cost and latency enter the picture, the decision becomes even more nuanced.
The systematic use of Pareto frontier analysis enables Ramp to make informed tradeoffs. For invoice OCR, Gemini 3 Flash offers nearly the same accuracy as much more expensive models at a fraction of the cost, making it the clear choice. For financial statement extraction, the choice depends on whether precision or limiting large errors is more important. For compliance, the choice depends on the relative cost of false positives versus false negatives.
This sophisticated approach to model selection represents mature LLMOps practice where evaluation frameworks enable data-driven decisions rather than relying on general model leaderboards or provider marketing claims. The case study demonstrates healthy skepticism—while documenting impressive results, Ramp frames findings in terms of specific task performance rather than making sweeping claims about model superiority.
An interesting finding across multiple tasks is the impact of inference-time compute (allowing models more “reasoning” time). For invoice OCR, almost all models saw performance increases with higher reasoning efforts. This suggests that for complex extraction tasks with derivable patterns, allowing models to “think longer” provides meaningful accuracy improvements, though at higher cost and latency.
However, this pattern didn’t hold across all tasks. For smart routing, increasing reasoning didn’t improve performance, suggesting that the bottleneck was not reasoning depth but rather access to the right information and tools. This illustrates an important LLMOps principle: inference-time compute is not universally beneficial, and the decision to use reasoning models should be based on task-specific evaluation rather than assumptions about what types of problems benefit from more reasoning.
Ramp emphasizes that their benchmarking framework is crucial not just for current model selection but for enabling rapid iteration while maintaining reliability. The framework allows them to quickly evaluate new models as they’re released, probe their strengths and weaknesses, and make customer-experience-first decisions when choosing between them.
This represents a mature LLMOps philosophy where evaluation infrastructure is treated as a first-class concern enabling development velocity rather than a post-hoc validation step. The company explicitly states they “ship fast, but emphasize trust and reliability” and “can’t afford to break things,” with continuous testing and benchmarks ensuring “speed of iteration comes with no compromises.”
Looking forward, Ramp plans to make the framework more robust and expand from simply benchmarking current performance to “hill climbing on capabilities”—suggesting an evolution toward using the evaluation framework not just for model selection but for systematic prompt engineering, few-shot example selection, and other optimization techniques.
While this case study provides valuable insights into production LLMOps practices, several caveats should be noted. First, the benchmarks are necessarily specific to Ramp’s use cases and customer base—results may not generalize to other companies’ financial workflows or document types. The strong performance of Gemini models on visual tasks, for instance, may reflect characteristics of the specific types of invoices and financial statements in Ramp’s dataset.
Second, the case study is published by Ramp itself and naturally emphasizes their successes. The lack of detailed discussion about failures, edge cases, or ongoing challenges should be noted. For instance, what happens when models fail catastrophically? How are errors caught before reaching customers? What human-in-the-loop mechanisms exist? These production reliability questions are only briefly touched upon.
Third, model performance is measured at a single point in time, but production LLMOps must also consider model stability and consistency over time, especially as providers update their models. The case study doesn’t address monitoring for model drift or handling model version changes.
Finally, while the cost analysis is valuable, it focuses on raw inference costs without discussing the total cost of ownership including engineering time for integration, monitoring, debugging, and maintenance. The true LLMOps cost comparison between models may differ from the pure inference cost comparison presented.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
London Stock Exchange Group (LSEG) Risk Intelligence modernized its WorldCheck platform—a global database used by financial institutions to screen for high-risk individuals, politically exposed persons (PEPs), and adverse media—by implementing generative AI to accelerate data curation. The platform processes thousands of news sources in 60+ languages to help 10,000+ customers combat financial crime including fraud, money laundering, and terrorism financing. By adopting a maturity-based approach that progressed from simple prompt-only implementations to agent orchestration with human-in-the-loop validation, LSEG reduced content curation time from hours to minutes while maintaining accuracy and regulatory compliance. The solution leverages AWS Bedrock for LLM operations, incorporating summarization, entity extraction, classification, RAG for cross-referencing articles, and multi-agent orchestration, all while keeping human analysts at critical decision points to ensure trust and regulatory adherence.
Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.