ZenML

Agentic Workflow Automation for Financial Operations

Ramp 2026
View original source

Ramp, a finance automation platform serving over 50,000 customers, built a comprehensive suite of AI agents to automate manual financial workflows including expense policy enforcement, accounting classification, and invoice processing. The company evolved from building hundreds of isolated agents to consolidating around a single agent framework with thousands of skills, unified through a conversational interface called Omnichat. Their Policy Agent product, which uses LLMs to interpret and enforce expense policies written in natural language, demonstrates significant production deployment challenges and solutions including iterative development starting with simple use cases, extensive evaluation frameworks, human-in-the-loop labeling sessions, and careful context engineering. Additionally, Ramp built an internal coding agent called Ramp Inspect that now accounts for over 50% of production PRs merged weekly, illustrating how AI infrastructure investments enable broader organizational productivity gains.

Industry

Finance

Technologies

Overview

Ramp is a finance automation platform that has deployed production LLM systems across their entire product suite to automate time-consuming financial operations tasks. The company serves over 50,000 customers and has systematically built AI capabilities that span from simple transaction classification to complex multi-step agentic workflows. The presentation covers both customer-facing AI products, particularly their Policy Agent for expense management, and internal tooling including Ramp Inspect, a coding agent that handles over 50% of their production pull requests.

Architectural Evolution and Strategic Shift

Ramp’s AI journey represents a significant architectural evolution. Initially, the company pursued a decentralized approach where individual teams built their own agents, resulting in hundreds of discrete agents with inconsistent implementations and potentially four different ways of doing the same thing. This experimentation phase, while valuable for learning, created technical debt and operational complexity.

Following major model releases in early 2026, Ramp underwent a fundamental paradigm shift in their AI architecture. Rather than maintaining hundreds of specialized agents, they pivoted to building a single unified agent with thousands of skills. This consolidation required rethinking their entire stack and represents a move from discrete task automation to more general-purpose agentic systems. The company emphasizes that modern AI-native software must handle all five components of a process: events, prompt instructions, guardrails like policies, context, and tools or APIs. Traditional software focused only on the last two, but the new paradigm requires managing the entire workflow.

Unified Conversational Interface: Omnichat

A key infrastructure decision was consolidating all conversational interactions into a single interface called Omnichat. Prior to this consolidation, Ramp had approximately five different conversational UX implementations across their product. Omnichat is described as omnipresent, deployed to every surface of the product, and designed to work harmoniously with traditional UI elements like tables and buttons since users don’t always want conversational interfaces for every interaction.

Omnichat demonstrates sophisticated orchestration capabilities. For example, when a user requests to onboard a new employee, the system can resolve the request to an employee ID, look up corporate structure through HRIS integrations, identify previously created agentic workflows like a new hire playbook, and ask for confirmation before execution. This capability is powered by a lightweight in-house agent framework that provides orchestration with tools that engineers can quickly build. Notably, one product manager reportedly vibe-coded about 20 tools, indicating that tool creation has been democratized beyond engineering teams.

The system also supports workflow definitions called “playbooks” where users can describe multi-step processes in natural language. For example, an employee onboarding playbook might specify giving the employee a card, ensuring receipt collection, sending a congratulations message on Slack, and scheduling a two-week check-in. These playbooks compile into runnable deterministic workflows that agents can execute, combining the flexibility of natural language description with the reliability of structured execution.

Policy Agent: Deep Dive on Production LLM Deployment

The Policy Agent represents Ramp’s most detailed case study in production LLM deployment. This agent enforces company expense policies in real-time as transactions occur, addressing a significant pain point for finance teams who traditionally spend considerable time reviewing receipts and transactions manually.

Problem Context and User Pain

Finance teams at large companies process hundreds or thousands of receipts daily, making manual review error-prone and time-consuming. A Fortune 500 customer approached Ramp with extensive lists of approval and rejection rules, creating an opportunity to rethink policy enforcement. Rather than adding more deterministic rules to their product, Ramp recognized that natural language expense policies could directly serve as the programming logic for the agent, embodying the principle that “English is the new programming language.”

Architectural Iterations

The Policy Agent went through several distinct architectural iterations, each adding complexity while trading off explainability for capability:

Initial Simple Architecture: The first version used a classic retrieve-and-generate pattern. When an expense came in, the system would retrieve relevant context, pass it through well-defined LLM calls to determine if the expense was in policy, generate reasoning for the decision, and produce user-facing output. This approach was straightforward and relatively traceable.

Conditional Classification: The team learned that different expense types required different handling, so they added expense classification (travel, meals, entertainment, etc.) and conditional prompting based on category. This allowed more targeted context retrieval and specialized reasoning for each expense type. They also began giving the agent tools so it could autonomously decide when it needed additional information like flight details or employee level.

Full Agentic Workflow: The final architecture evolved into a complete agentic system with complex, shared tools that can read across the entire Ramp platform. The agent gained write capabilities, creating decisions, reasoning, and auto-approvals on behalf of users. Most significantly, it operates in a loop rather than a single pass, making it fundamentally more autonomous but also more of a “black box” where the exact execution path isn’t predetermined.

Starting Small and Iterating

A critical cultural and technical learning was the importance of starting with constrained problems rather than attempting to automate all of finance immediately. Ramp began with an extremely narrow use case: coffee purchases with colleagues. These transactions are single-person, low-dollar-amount, and low-risk according to finance teams, making them ideal for building confidence in the system.

This narrow focus served multiple purposes. First, it allowed the team to learn what context was actually necessary rather than trying to anticipate everything upfront. Second, it provided early validation that the approach could work, building internal confidence and enabling customer acquisition as design partners. Third, it kept the system simple enough that iteration cycles remained fast.

The team emphasizes that even though the problem sounds simple (“Is this in policy or not?”), complexity grows rapidly when generalizing across different businesses and policies. Starting simple meant they could layer complexity incrementally based on actual learnings rather than speculation.

Context Engineering as Key Success Factor

One of the most important learnings was that model failures often stemmed from inadequate context rather than model limitations. Early production deployment revealed critical missing context that wasn’t obvious during initial design. For example, employee role and title proved essential because expense policies often vary by level, with C-suite executives potentially having higher limits or different rules like first-class flight permissions.

The team systematically added context based on production learnings: extracting more information from receipt images, pulling in HRIS fields already available in Ramp, accessing chart of accounts and ERP data for accounting classification, and retrieving historical transaction patterns. This iterative context discovery was only possible because they deployed early with a simple system and learned from real usage.

Ground Truth and Evaluation Framework

Ramp established rigorous evaluation practices that proved essential for rapid iteration and quality assurance. Central to their approach was a weekly cross-functional labeling session where team members across product, engineering, design, and other functions collaboratively labeled expenses as correctly approved or rejected.

This practice yielded two critical outcomes. First, it created a ground truth dataset for automated evaluation. Second, it ensured alignment across all functions—when the agent made an error or lacked context, everyone understood the issue simultaneously, reducing communication overhead and enabling prioritization consensus.

Initially, gathering cross-functional teams weekly and assigning homework to label 100 data points seemed expensive and sometimes tedious, with participants not always completing their assignments. To streamline this, Ramp evaluated third-party labeling tools but found them either too specific or too general for their use case. Instead, they built their own custom labeling tool using Claude and Streamlit, essentially vibe-coding the entire solution.

This custom tool proved remarkably effective because it was low maintenance, low risk, isolated in the codebase for easy fixes, deployed instantly, and could be personalized by non-engineers through vibe coding or Claude Code. With Claude Opus improvements, they expect this approach to become even more effective. The lesson here is that for specific internal use cases, quickly building custom tooling can be cheaper and more effective than integrating third-party solutions.

Online and Offline Evaluation Strategy

Ramp distinguishes between offline and online evaluation. Offline evaluation uses historical ground truth datasets to test changes before deployment. They started with just five critical examples they knew they couldn’t fail, gradually expanding the dataset. The key principles were making evals easy to run (anyone could execute them), ensuring results were immediately understandable (showing what the model did, what was good, what was bad), and integrating them into CI/CD so engineers could safely merge code.

A crucial insight was that seemingly beneficial changes like adding context or tools often had unexpected negative consequences such as context rot, confusing tool instructions, or conflicting docstrings. Comprehensive evaluation caught these regressions before production deployment.

Online evaluation measured system health during production operation. For Policy Agent, they tracked metrics like the rate of “unsure” decisions, which indicated insufficient information. While online evaluation can be more ambiguous than offline, establishing even simple health metrics provided valuable leading indicators of system performance.

The evaluation infrastructure also enabled confident model switching. When new models like Claude Opus 4.6 or GPT-5.3 are released, comprehensive evals allow teams to benchmark whether the new model actually improves performance for their specific use case or potentially introduces regressions, enabling data-driven model selection.

Human-in-the-Loop and Autonomy Graduation

Ramp’s rollout strategy carefully balanced autonomy with risk management. They initially targeted Fortune 500 enterprise customers with the most volume and therefore the most value at stake. However, the initial deployment provided only suggestions, not autonomous actions, framing the agent as offering recommendations rather than making final decisions.

This conservative approach built trust over time. Eventually, customers returned requesting to move from suggestions to auto-approvals, specifically for low-stakes transactions like anything under $20 where the agent was mostly correct. Ramp responded by providing an “autonomy slider” that customers could configure themselves, allowing them to control what types of transactions received automatic approval versus human review.

This graduated autonomy approach respects that trust must be earned through demonstrated reliability rather than promised capability. It also acknowledges that different organizations have different risk tolerances and should control their own automation boundaries.

User Correctness and Ground Truth Definition

An unexpected learning was that users aren’t always correct in their decisions. When the team initially assumed user decisions represented ground truth, they discovered that users sometimes didn’t know the expense policy, trusted their employees without verification, or made lazy decisions (particularly on weekends). Finance teams would later flag these as incorrect, revealing that user behavior couldn’t serve as the ground truth.

This forced Ramp to define their own correctness criteria independent of user behavior, which is where the cross-functional labeling sessions became essential. By establishing explicit ground truth through collaborative labeling, they could train and evaluate the agent against policy correctness rather than user behavior.

Policy as Code: Empowering Finance Teams

A powerful outcome of the Policy Agent approach is that it treats the expense policy document itself as executable logic. Finance teams can modify their policy documents to change agent behavior, creating a tight feedback loop. Initially, finance teams found this concept somewhat intimidating—policy documents are typically carefully controlled and modified through extensive approval processes.

However, once teams experienced the immediate feedback loop where policy changes directly affected agent decisions, they became enthusiastic about this capability. This represents a democratization of system configuration, allowing domain experts to directly control system behavior through natural language rather than requiring engineering intervention.

Production Architecture and Tool Orchestration

The Policy Agent operates through a sophisticated real-time workflow triggered by card swipes. It performs real-time policy review directly in the software, enforcing company spending requirements and making it safe to distribute cards to all employees. The agent hands off to an accounting coding agent that classifies transactions, applies back-office rules, and matches transactions to the appropriate general ledger accounts, solving a common problem where employees don’t know how to categorize transactions.

The system can auto-approve compliant transactions or involve humans in the loop for material transactions or out-of-policy spending. This multi-agent orchestration demonstrates how specialized agents can collaborate within a larger workflow while maintaining clear boundaries and handoff protocols.

Applied AI Infrastructure

Beyond product features, Ramp has invested heavily in internal AI infrastructure to provide leverage for engineering and cross-functional teams. This infrastructure serves as a force multiplier, abstracting complexity so product teams can focus on delivering customer value.

Applied AI Service: Centralized LLM Orchestration

The core of Ramp’s AI infrastructure is their Applied AI Service, which functions similarly to an LLM proxy like LiteLLM but with three critical extensions:

Structured Output and Consistent APIs: The service provides unified SDKs and APIs across different model providers (OpenAI, Anthropic, Google, etc.). Given how rapidly provider APIs change, Ramp didn’t want downstream product teams dealing with these differences. Teams can switch from GPT-5.3 to Claude Opus or Gemini 3 Pro with a simple config change, enabling rapid experimentation with semantic similarity, code sandboxing, and structured outputs without rewriting integration code.

Batch Processing and Workflow Handling: The service handles batch processing complexity including rate limit management and routing between online and offline jobs (like Anthropic’s batch API). Product teams simply submit work without worrying about batching strategy, particularly valuable for evaluation and bulk document or data analysis.

Cost Tracing and Attribution: The infrastructure traces costs across teams and products, enabling identification of cost-performance tradeoff curves and highlighting teams building unsustainable solutions. This removes the burden of cost tracking from individual product teams while providing organizational visibility into AI spending patterns.

A significant benefit is the ability to keep all services at the frontier of model releases. When a new model launches, it’s a one-line config change that impacts every downstream SDK. Rather than dozens of teams learning new APIs and updating multiple call sites, a single config change in the Applied AI Service upgrades every team to the latest models that have been vetted and integrated.

Internal Tool Catalog: Shared Agent Capabilities

Ramp has built an extensive catalog of internal tools that agents can leverage, described as a “company internal toolbox.” Examples include getting policy snippets, PDM rates, recent transactions, and hundreds of other capabilities. As of the presentation, they have many hundreds of tools with expectations to reach thousands over time.

These tools are shared across all agents, not siloed to specific products. This means a team prototyping a new reimbursement agent can immediately access the full catalog of tools, APIs, and system integrations without building anything from scratch. The catalog makes gaps in capabilities visible—if a tool doesn’t exist for a use case, it’s obvious—and enables rapid prototyping of new agentic workflows.

The tool catalog also supports vibe coding and rapid development by non-engineers. By providing well-documented, reusable components, the infrastructure lowers the barrier for experimentation and feature development across the organization.

Ramp Inspect: Internal Coding Agent

Ramp’s most ambitious internal AI project is Ramp Inspect, a background coding agent that addresses the fragmentation of engineering context across multiple systems. Engineers work across production databases, logs in Datadog, alerting systems, Incident.io, Slack messages, Notion docs, and tool-specific knowledge. This context fragmentation limits the effectiveness of generic coding assistants like Claude Code or Codex.

Architecture and Capabilities

Ramp Inspect runs autonomously in the background, particularly useful when engineers are in meetings or working on other tasks. It operates in isolated Modal code sandboxes that spin up quickly and match the development environment engineers use locally. Sessions create GitHub branches and integrate with context documents, Datadog, read replicas for query writing, and product team documentation.

The agent follows a series of tasks to stay on track and provides full visibility into its reasoning and actions. Critically, Ramp designed it as a multiplayer-first experience, enabling collaboration between engineers and cross-functional partners like designers or product managers. This design choice had outsized impact, as it allows less technical team members to level up their prompting skills, provide feedback on failures, and participate in the development process.

Sessions can be initiated through a Kanban UI, an API, or Slack threads. When kicked off via Slack, the agent captures the full thread context, eliminating the need to re-prompt with prior conversation. This integration recognizes that work discussions happen in Slack and brings the agent into that existing workflow.

Ramp Inspect also runs a full VS Code environment using VNC inside Modal sandboxes, enabling Chrome dev tools and Model Context Protocol (MCP) integrations. This allows full-stack development work. The agent has access to Ramp’s 150,000+ test suite, so it knows when things break, can respond to CI failures in GitHub, and can patch fixes before notifying engineers that a PR is ready.

Production Impact and Open Source

The impact of Ramp Inspect has been substantial: over 50% of PRs merged to production each month go through the system. Ramp has embraced this metric with a public dashboard showing sessions by department, creating healthy competition and inspiration. While Engineering leads in usage, Product, Design, Risk, Legal, Corporate Finance, Marketing, and Customer Experience teams all use Ramp Inspect for tasks like copy changes, logic fixes, incident response, and bug fixes.

Ramp has open-sourced the blueprint for building Ramp Inspect at builders.ramp.com, and there’s an open-source implementation called Open Inspect available on GitHub. This transparency reflects confidence in their approach and a desire to advance the broader industry.

Custom Tooling Philosophy

The Ramp Inspect project reinforces a broader philosophy about custom tooling. Rather than spending weeks evaluating third-party tools that might be too specific or too general, Ramp built their own solution using Claude and Streamlit. The result is low maintenance, low risk (isolated in the codebase), instantly deployable, and customizable by non-engineers through vibe coding.

This represents a pragmatic approach to AI tooling: when off-the-shelf solutions don’t fit, modern LLMs make custom development fast and cost-effective, particularly for internal use cases where perfect polish isn’t required.

Cultural Transformation and Engineering Evolution

Beyond technical infrastructure, Ramp emphasizes that successful AI adoption requires fundamental cultural shifts in how teams operate and how engineering roles evolve.

Team Qualities and the Divergence

Ramp frames the cultural challenge through a thought experiment contrasting two team archetypes. High-performing teams care about impact, handle ambiguous problems, understand product/business/data holistically, adopt new tools readily, find creative solutions, and obsess over user experience. Less effective teams debate libraries endlessly, add process when things feel chaotic, constantly complain about headcount, bikeshed details instead of focusing on users, build before understanding problems, and engage in performative code quality debates.

Ramp argues that the AI era will cause divergence between these team types. While media narratives often focus on junior versus senior experience levels, Ramp believes the real differentiator is the qualities listed above rather than years of experience. Coding was never the hardest part of software development; judgment, context, ability to see around corners, learning from experience, and scar tissue have always been more valuable.

The Value of Staff-Plus Engineers

Staff and staff-plus engineers are compensated primarily for judgment and context rather than raw coding speed. They know when proposed solutions won’t work or represent bad ideas, a capability that Claude Opus, despite its prowess, doesn’t inherently possess. This highlights that coding agents can help teams build the wrong thing much faster and create bigger messes if not guided by sound engineering judgment.

The critical skills that remain valuable include figuring out what to build, understanding users deeply enough to make good product decisions, selling ideas to skeptical stakeholders, making good design decisions with incomplete information, and maintaining momentum through the “long middle” of projects when excitement fades and hard problems emerge.

The Long Middle and Product-Market Fit

Ramp particularly emphasizes the “long middle” of projects—the phase between initial prototyping and successful deployment where teams must work through gnarly problems, user feedback, edge cases, and operational challenges. While it’s easy to vibe code a demo, getting from demo to production-ready product with real product-market fit requires exactly the kind of expertise that experienced engineers bring. Media narratives about SaaS and stock market reactions often gloss over this reality.

Capacity and Opportunity

Rather than viewing AI productivity gains as a path to headcount reduction, Ramp sees them as enabling pursuit of opportunities previously too expensive to consider. Software is perpetually unfinished (“jobs not finished” is an internal meme), so extra capacity doesn’t lead to downsizing but rather to expanding ambitions.

Ramp predicts four outcomes from increased AI-driven productivity: companies will chase opportunities they couldn’t afford to pursue before (like agentic financial workflows), they’ll enter adjacent markets and stitch together more value for customers, they’ll rebuild systems that were too expensive to touch (like building an internal coding agent for a financial operations company), and they’ll raise the bar for what “good enough” means, creating more mind-blowing user experiences.

Operating Principles for AI-Native Teams

Ramp’s cultural shift includes several operating principles. Teams should intentionally allow experimentation even if it creates temporary inconsistency (they let teams experiment with different agent approaches last year). They should start with simple, constrained problems rather than attempting comprehensive automation immediately. Everyone on the team—PMs, designers, engineers—must align on the expectation that AI products cannot be “one-shotted”; perfection on day one is impossible, and iteration is essential.

Teams should dogfood aggressively, deploying internally before external release to learn in a controlled environment. They should focus on building evaluation infrastructure early, even starting with just five critical examples, and making evaluation easy to run and understand. They should establish tight feedback loops, particularly for domain experts like finance teams who can directly modify policies, and they should think carefully about multiplayer experiences that enable collaboration and learning across skill levels.

Practical Deployment Considerations

Throughout the case study, several practical deployment considerations emerge:

Model Switching: Infrastructure should enable easy model switching because new model releases can significantly impact performance, but the direction isn’t always positive without prompt or context adjustments.

Explainability Trade-offs: As systems become more agentic and capable, they become less explainable. This makes auditability essential from the beginning—teams should assume inputs and outputs are all they can reliably verify, even if they understand internal mechanics.

Context Is Key: Model failures often stem from missing context rather than model limitations. Production deployment is the fastest way to discover what context is actually needed versus what seems theoretically relevant.

User Behavior ≠ Ground Truth: Users make mistakes, so evaluation cannot rely on user behavior alone. Explicit ground truth definition through expert labeling is essential.

Graduated Autonomy: Starting with suggestions rather than autonomous actions builds trust. Providing customers with autonomy controls respects different risk tolerances while enabling automation expansion as confidence grows.

Tool Catalog Investment: Building a shared catalog of well-documented, reusable tools accelerates development of new agentic capabilities and makes capability gaps visible.

Cross-Functional Alignment: Regular cross-functional labeling and review sessions keep everyone aligned on ground truth, priorities, and system behavior, reducing communication overhead.

Custom Tooling When Appropriate: Modern LLMs make building custom internal tools fast and cost-effective when off-the-shelf solutions don’t fit, particularly for internal use cases.

Balanced Assessment

Ramp’s presentation demonstrates sophisticated LLMOps practices and meaningful production impact. The company has clearly invested heavily in infrastructure, evaluation, and cultural transformation. However, several considerations warrant balanced assessment:

Self-Reported Metrics: The claim that Ramp Inspect accounts for over 50% of production PRs is impressive but lacks detail on what types of PRs these are, their complexity, and how much human involvement was required. Simple copy changes and major feature work count equally as PRs, but represent vastly different value.

Generalizability: Ramp’s approach requires significant infrastructure investment, dedicated AI platform teams, and cultural buy-in. The applicability to smaller companies or organizations with less mature engineering practices is unclear.

Black Box Trade-offs: Ramp acknowledges that agentic systems become black boxes where “we have no control over it, it’s going to do what it thinks is right.” This is honest about limitations, but raises questions about accountability in financial operations where regulatory compliance and auditability are critical.

Customer Validation: While Ramp mentions Fortune 500 design partners and positive feedback, specific adoption numbers, customer satisfaction metrics, or quantitative impact measurements are absent. The Super Bowl advertising reference suggests confidence in the product, but independent validation would strengthen claims.

Evaluation Rigor: The evaluation practices described are solid, particularly the cross-functional labeling, but the presentation doesn’t address how they handle distributional shift, adversarial inputs, or edge cases that weren’t in training data.

Timeline and Maturity: The presentation references starting these efforts “about 3 years ago” and “about a year and a half ago” for various initiatives, suggesting the described systems have been in production for meaningful periods. However, the specific state of maturity and what “production” means (beta, general availability, optional feature, default behavior) isn’t always clear.

Despite these considerations, Ramp’s case study represents genuine sophisticated LLMOps practices with real production deployment, comprehensive evaluation frameworks, thoughtful infrastructure investments, and honest acknowledgment of challenges like the explainability-capability trade-off and the need for iterative development.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Running LLM Agents in Production for Accounting Automation

Digits 2025

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

healthcare fraud_detection customer_support +50

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61