Using RL to Make a 4B Parameter Model Outperform a 235B Parameter Model on Financial Analysis Tool Use

Snorkel 2026
View original source

Snorkel, in partnership with UC Berkeley's RLLM team, demonstrated that a 4 billion parameter model fine-tuned with reinforcement learning could outperform a 235 billion parameter reasoning model on financial analysis tool use tasks. The problem being addressed was that enterprises often default to using larger, more expensive models to improve performance in production settings, particularly for financial analysis tasks requiring tool use. By generating a high-quality expert-curated dataset and applying GRPO reinforcement learning for under $500 in a 21-hour training run, they achieved a doubling of pass-at-one performance. The key insight was that the failure mode wasn't reasoning capability but rather tool discipline—teaching the smaller model to properly inspect available tools, query schemas, and self-correct errors led to improvements that generalized across both single-table and multi-table query tasks.

Industry

Finance

Technologies

Overview

This case study presents research conducted by Snorkel in partnership with UC Berkeley’s RLLM team, demonstrating how reinforcement learning with high-quality data can enable smaller models to outperform significantly larger models in production use cases. The specific focus was on financial analysis tool use, where a 4 billion parameter model was fine-tuned to exceed the performance of a 235 billion parameter reasoning model. The research addresses a common enterprise challenge: the tendency to solve performance problems by simply deploying larger models, which increases inference costs, complicates deployment, and raises security concerns, particularly for sensitive domains like finance and healthcare.

The work exemplifies several critical LLMOps principles: cost-effective model optimization, the importance of identifying specific failure modes rather than assuming more reasoning capability is always the answer, and the value of expert-curated datasets in achieving production-ready performance with smaller, more deployable models.

The Production Challenge

Snorkel positions itself as the “Frontier AI Data Lab,” focusing on providing datasets with assured quality levels through expert-in-the-loop processes. The company works with PhD-level experts and industry professionals to generate training data, emphasizing data quality as fundamental to their approach. In this research, they tackled a problem prevalent in enterprise LLM deployments: when initial model performance proves insufficient for production requirements, organizations typically respond by deploying larger models with greater reasoning capabilities, assuming this will automatically translate to better task performance.

This approach creates multiple challenges for production deployment. Larger models incur significantly higher inference costs, require more computational resources, complicate on-premise deployment scenarios, and increase concerns around data control and security—particularly critical in regulated industries like finance and healthcare. The research team hypothesized that for specific use cases, particularly those involving tool use rather than pure reasoning, smaller models could achieve comparable or superior performance if trained appropriately with reinforcement learning and high-quality data.

Technical Approach and Methodology

The research team’s approach centered on three key components: generating a high-quality expert-curated dataset, building a specialized evaluation environment, and applying reinforcement learning with Group Relative Policy Optimization (GRPO). The philosophy underlying this work is that behavior modification—teaching a model how to use tools effectively—is fundamentally different from knowledge acquisition and is well-suited to RL techniques.

The dataset generation process leveraged Snorkel’s platform for working with domain experts. They engaged financial analysis experts to create tasks and questions appropriate to the domain. Critically, the process included verification steps to ensure that defined tasks were appropriate, that queries could actually return results, and that verifiable correct answers existed. This attention to data quality represents a core LLMOps principle: garbage in, garbage out applies especially forcefully when training production models.

The team built a custom environment called FinQA specifically for financial analysis tool use evaluation. This environment is fully self-contained with no external dependencies, allowing it to be deployed anywhere without concerns about remote data access—a crucial consideration for financial services deployments. The FinQA environment provides a specific set of tools that models can interact with, including capabilities to discover available tables, inspect schemas, and execute SQL queries. The environment includes two benchmark sets: the standard FinQA with 290 samples, and FinQA Reasoning with 79 more challenging samples requiring multi-table queries.

The environment has been published and made accessible through multiple channels: it’s available on PrimeIntellect’s infrastructure, in the OpenEnv repository on GitHub, and hosted in Hugging Face spaces through a collaboration between PyTorch and Hugging Face teams. This broad accessibility reflects good LLMOps practice around reproducibility and enabling others to validate and build upon research findings.

For the actual training, the team used GRPO reinforcement learning starting with a 4 billion parameter base model. The entire training run completed in 21 hours at a total cost of under $500 per run. This cost-effectiveness is particularly significant from an LLMOps perspective—it demonstrates that achieving substantial performance improvements through RL doesn’t require prohibitively expensive training runs, making such techniques accessible for enterprise teams with limited ML budgets.

Comparative Performance Analysis

The research included detailed comparison between the fine-tuned 4 billion parameter model and a 235 billion parameter quantized reasoning model. The comparison revealed fundamental differences in behavior that illuminate what “reasoning capability” actually means in production contexts. When asked a straightforward financial analysis question—“What is the year-over-year growth rate of YouTube ads revenue from ‘23 to ‘24?”—the larger reasoning model exhibited poor tool discipline despite its sophisticated reasoning abilities.

Specifically, the 235 billion parameter model attempted to query a non-existent table without first inspecting what tables were available in the environment. When that failed, it tried again with another guess, also failing. Having received no data from either attempt, the model then hallucinated an answer rather than taking corrective action or indicating it couldn’t complete the task. This behavior represents a critical failure mode for production systems: unreliable tool use leading to confident but incorrect outputs.

In contrast, the fine-tuned 4 billion parameter model demonstrated systematic tool use. It first called the get_table_names tool to discover what tables were available. It then used get_table_info to inspect the schema and understand what columns it could query. When it initially made a query that resulted in an error—requesting a “revenue” column that didn’t exist—the model observed the error and self-corrected, identifying and querying the correct column name. This error correction capability, learned through the RL process, represents exactly the kind of robust behavior required for production deployments.

Results and Key Findings

The quantitative results validated the approach convincingly. The fine-tuned 4 billion parameter model achieved essentially double the pass-at-one performance compared to its pre-training baseline, representing a dramatic improvement in task completion capability. More importantly, it outperformed the 235 billion parameter reasoning model on these financial analysis tasks, demonstrating that the right training approach matters more than raw model scale for specific use cases.

An interesting finding emerged from experiments with different training data compositions. The team tested three approaches: training only on single-table questions, training on a mix of single-table and multi-table questions, and curriculum learning that progressively introduced multi-table complexity. Surprisingly, training exclusively on single-table questions yielded the best overall performance. Even more remarkably, this single-table training regime produced similar performance improvements on the harder FinQA Reasoning benchmark containing multi-table questions. Performance on that benchmark jumped from 13.9% to 26.6%—again roughly a doubling.

This generalization finding has important implications for LLMOps practice. It suggests that the core skill being learned—tool discipline and systematic interaction with the environment—transfers across task complexities. The model wasn’t learning specific reasoning patterns for multi-table joins; it was learning to properly discover, inspect, and utilize available tools, and to self-correct when errors occurred. This tool discipline proved to be the critical failure mode that needed addressing, not reasoning capability.

LLMOps Insights and Production Considerations

From an LLMOps perspective, several crucial insights emerge from this work. First, the research demonstrates the importance of identifying specific failure modes rather than assuming more capability always means better performance. The 235 billion parameter model failed not because it lacked reasoning ability but because it lacked tool discipline. Throwing more reasoning capability at the problem wouldn’t fix the fundamental behavior issue.

Second, the work validates that reinforcement learning is becoming increasingly accessible for production teams. The sub-$500, 21-hour training run represents a feasible investment for most organizations working with models in production. This accessibility means enterprises don’t need to accept poor performance from smaller models or automatically escalate to larger, more expensive alternatives.

Third, the emphasis on expert-curated data quality reflects a maturing understanding in the field. Snorkel’s approach of bringing domain experts into the loop, verifying that tasks are well-defined and answerable, and ensuring high data quality represents best practices for training production models. The downstream performance gains validate this investment in data quality.

Fourth, the fully self-contained evaluation environment addresses real production concerns around data sovereignty and deployment flexibility. For financial services and healthcare applications, the ability to deploy entirely on-premise without external dependencies can be a hard requirement. The FinQA environment design demonstrates attention to these practical constraints.

The use of rubrics for evaluation, mentioned toward the end of the presentation, represents another important LLMOps practice. Rather than simply measuring binary success/failure, breaking down model responses into multiple dimensions through rubrics enables teams to identify specific areas needing improvement. This richer feedback can guide decisions about what training data to generate or what behaviors to target with RL. The single-value reward signal used by GRPO during actual training doesn’t preclude this richer analysis during evaluation and diagnosis.

Deployment and Scaling Considerations

The case study addresses several practical deployment considerations relevant to enterprise LLMOps. The emphasis on smaller models directly impacts deployment feasibility—a 4 billion parameter model can run on less expensive hardware, with lower latency, and with reduced infrastructure costs compared to models 50x larger. For organizations concerned about costs at scale, this matters enormously when serving potentially millions of requests.

The on-premise deployment capability, enabled by the self-contained environment design and smaller model size, addresses regulatory and security requirements common in financial services. Organizations can maintain complete control over data without it leaving their infrastructure, while still achieving strong performance on specialized tasks.

The cost-effectiveness of the RL training loop also has implications for iterative improvement in production. If performance issues emerge or requirements change, teams can relatively inexpensively run new training loops rather than being locked into whatever a large pretrained model provides out of the box. This enables more agile adaptation to production needs.

Critical Assessment

While the results are impressive, several caveats warrant consideration. The task domain is relatively constrained—financial analysis with structured tools—which may be more amenable to this approach than more open-ended tasks. The presentation doesn’t detail how well the approach scales to environments with many more tools or to tasks requiring longer chains of tool interactions.

The comparison to the 235 billion parameter model, while striking, involves a quantized model which may not represent the best possible performance from that scale. A fair comparison might include the non-quantized version or other large models specifically fine-tuned for tool use. That said, the practical reality is that many organizations do deploy quantized large models for cost reasons, so the comparison remains relevant to real-world deployment decisions.

The presentation emphasizes Snorkel’s expert-in-the-loop approach but provides limited detail on the actual cost and time required for expert curation of the training dataset. For organizations considering this approach, understanding the full cost picture including data generation would be important. The $500 figure covers training but not the preceding data work.

The environment’s self-contained nature is a strength for deployment but might not reflect the complexity of production financial analysis systems that often need to integrate with multiple external data sources, APIs, and legacy systems. The generalization of these results to messier, more complex production environments remains an open question.

Broader Implications

This research contributes to an important ongoing conversation in the LLMOps community about the relationship between model scale, task performance, and deployment practicality. The “Terence Tao effect” concept—the idea that you don’t need a universally brilliant mathematician to perform specific financial calculations—provides a useful mental model for thinking about when smaller, specialized models might suffice.

The work also validates the maturation of reinforcement learning tooling for LLM applications. As RL becomes more accessible and cost-effective, it becomes a viable optimization strategy for production teams, not just research labs. The publication of the environment and integration with popular platforms like Hugging Face and OpenEnv lowers barriers to adoption.

Finally, the emphasis on behavior modification versus knowledge acquisition provides a useful framework for deciding when RL might be appropriate. Tasks requiring models to follow specific interaction patterns, use tools systematically, or exhibit particular behavioral characteristics may be especially amenable to RL approaches, even with smaller base models.

The partnership model between an industry-focused company like Snorkel and an academic research group like UC Berkeley’s RLLM team also represents a productive approach to advancing the state of practice in LLMOps, combining academic rigor with attention to practical deployment concerns.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Observable, Debuggable, and Durable Agentic Systems with Orchestration

Union 2026

Union's Chief ML Engineer shares lessons learned from productionizing agentic systems at scale, addressing the critical infrastructure challenges that arise when deploying LLM agents in production environments. The presentation introduces six design principles for building crash-proof, durable agents using the Flyte 2.0 orchestration platform, focusing on how agents can recover from multi-layer failures (infrastructure, network, logical, semantic) through proper context engineering and durability mechanisms. A key case study with Dragonfly demonstrates these principles in action, where a tiered agent architecture processes 250,000+ software products with 200+ steps and 100+ LLM calls each, achieving 2,000+ concurrent runs, 50% reduction in failure recovery time, 30% increased development velocity, and 12 hours per week saved on infrastructure maintenance.

fraud_detection code_generation data_analysis +49

Building and Evaluating Production AI Agents: From Function Calling to Complex Multi-Agent Systems

Google Deepmind 2025

This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.

code_generation chatbot summarization +28