Building a Custom Background Coding Agent with Cloud-Based Sandboxes

Ramp 2026
View original source

Ramp built Inspect, a custom background coding agent that writes and verifies code in isolated cloud-based environments. The system addresses the need for faster, more powerful development workflows by running sessions in sandboxed VMs on Modal with full development environments, integrated with production tools like Sentry, Datadog, and GitHub. Within months of deployment, approximately 30% of all pull requests merged to frontend and backend repositories were written by Inspect, demonstrating rapid internal adoption through voluntary usage rather than mandate. The platform enables unlimited concurrent sessions, supports multiple interaction modes (Slack, web, Chrome extension), includes multiplayer collaboration, and provides both automated code generation and verification capabilities.

Industry

Finance

Technologies

Overview

Ramp developed Inspect, an internally-built background coding agent designed to handle both code generation and verification within production-grade cloud environments. This case study is noteworthy because it represents a build-versus-buy decision where the company chose to create proprietary tooling specifically optimized for their internal workflows, rather than adopting off-the-shelf solutions. The system achieved notable adoption metrics, with roughly 30% of pull requests to frontend and backend repositories being generated by Inspect within a few months of rollout. The company emphasizes that this adoption was organic rather than mandated, suggesting genuine product-market fit within their engineering organization.

The core value proposition centers on giving coding agents not just the ability to write code, but the full context and verification tools that human engineers would use. This includes running tests, reviewing telemetry, querying feature flags for backend work, and performing visual verification with screenshots and live previews for frontend work. The system is designed to remove context limitations, making model intelligence the only bottleneck rather than missing information or tooling access.

Infrastructure and Sandbox Architecture

The foundation of Inspect rests on Modal, a cloud platform for AI infrastructure that Ramp already uses across their organization. Modal’s sandboxing capabilities provide near-instant startup times through file system snapshots that can freeze and restore state. This addresses what Ramp identifies as the key challenge: spinning up complete development environments quickly while maintaining isolation between different work streams.

The sandbox architecture employs an image registry pattern where each code repository has a corresponding image that is rebuilt every 30 minutes. During these builds, the system clones repositories, installs runtime dependencies, and executes initial setup and build commands. The completed state is saved as a snapshot. When developers initiate sessions, new sandboxes spin up from these pre-built snapshots, ensuring the code is at most 30 minutes out of date. This design choice trades perfect currency for dramatically improved startup performance. When sessions conclude, another snapshot is taken for potential restoration if users send follow-up prompts after the sandbox exits.

Several optimizations enhance the speed perception. The system begins warming sandboxes as soon as users start typing prompts, allowing cloning and setup to occur before they finish composing their request. For high-volume repositories, pools of pre-warmed sandboxes are maintained and rotated as new image builds complete. The agent can begin reading files immediately even if synchronization from the latest base branch isn’t complete, based on the reasonable assumption that prompts rarely modify files changed in the last 30 minutes. However, file edits are blocked until synchronization completes to prevent working with stale code.

The build process is heavily front-loaded with as much preparation as possible. This includes not just installation and compilation, but even running applications and test suites once to populate caches that subsequent runs will leverage. Since users interact with already-built images, they never experience the build latency, making lengthy preparation acceptable.

Agent Selection and Customization

Ramp selected OpenCode, an open-source coding agent, as their foundation. They articulate several reasons for this choice that reveal their evaluation criteria. OpenCode is structured as a server-first architecture with its terminal user interface and desktop applications acting as clients on top. This architectural pattern aligned with Ramp’s goal of creating multiple custom clients across different interaction surfaces. The agent provides a fully typed SDK and comprehensive plugin system, enabling extensive customization.

An underrated advantage they highlight is that OpenCode’s source code itself can serve as ground truth for the agent. When behavior is unclear, the AI can read OpenCode’s own implementation to understand exactly how it should function, reducing hallucinations about its capabilities. This suggests a sophisticated understanding of how LLMs behave in self-referential contexts. Ramp also notes that building something valuable with OpenCode creates opportunities to collaborate with the OpenCode team directly, indicating they value open-source ecosystem participation.

The agent is equipped with custom plugins and skills that encode Ramp-specific shipping practices. It supports multiple frontier models, suggesting they maintain flexibility in model selection rather than locking into a single provider. The system integrates with Model Context Protocol (MCP), indicating they’ve adopted emerging standards for agent-environment interaction.

One particularly interesting capability is that the agent can spawn additional sessions itself through custom tools. Ramp explicitly addresses the concern about runaway agent proliferation, asserting that frontier models are “smart enough to contain themselves.” The spawning capability enables research tasks across different repositories or decomposing major tasks into multiple smaller pull requests, essentially giving the agent orchestration capabilities.

Production Tool Integration

Inspect integrates deeply with Ramp’s production infrastructure and observability stack. The sandbox environments include Vite for frontend development, Postgres for database access, and Temporal for workflow orchestration. These aren’t simplified substitutes but the actual tools engineers use locally, ensuring the agent works with production-equivalent environments.

The agent connects to Sentry for error tracking, Datadog for monitoring and telemetry, LaunchDarkly for feature flag management, Braintrust (presumably for evaluation and testing), GitHub for version control, Slack for communication, and Buildkite for continuous integration. This comprehensive integration means the agent can access real-time production data, feature rollout status, and system health metrics when making decisions about code changes.

For backend verification, the agent runs tests and reviews telemetry directly. For frontend work, it performs visual verification and provides screenshots and live previews. This dual-mode verification reflects understanding that different types of code require different validation approaches—unit tests and observability for backend services versus visual confirmation for user interfaces.

The system wires into GitHub at a sophisticated level. Rather than having a single bot user open all pull requests, Ramp implemented GitHub app authentication that generates installation tokens during repository cloning. When opening pull requests, the system uses individual user tokens obtained through OAuth, ensuring pull requests are attributed to actual users rather than a bot account. This prevents users from approving their own AI-generated changes, maintaining code review integrity. Git operations use dynamic user configuration, updating user.name and user.email when committing and pushing changes to properly attribute work.

API and State Management Architecture

The backend API is built on Cloudflare Durable Objects, giving each session its own SQLite database. This architecture choice prioritizes performance isolation—no session can impact another’s performance even with hundreds running concurrently. Given that agents stream hundreds of token updates in short timeframes, this isolation prevents resource contention that could occur with shared database infrastructure.

Cloudflare’s Agents SDK provides abstractions over the WebSockets Hibernation API, allowing socket connections to remain open without incurring compute costs during idle periods. This is crucial for real-time streaming between the sandbox, API server, and connected clients. The state synchronization design enables multiple client types to interact with the same session consistently, whether through chat interfaces, Slack, Chrome extensions, or other inputs.

Multiplayer Collaboration Features

Ramp positions multiplayer support as “mission-critical” and claims it’s absent from competing products. The implementation allows any number of people to work in a session together, similar to collaborating on a code branch. Each person’s prompts that cause code changes are individually attributed to them. The use cases they highlight reveal their mental model of who should use coding agents.

Teaching scenarios involve product managers and designers learning to leverage AI for their own work, suggesting democratization of code contribution beyond traditional engineering roles. Live QA sessions enable teams to queue changes in real-time as issues are discovered rather than creating tickets for later resolution. Pull request reviews become more interactive, with reviewers able to immediately request AI-generated changes rather than just commenting and waiting for authors to implement them.

The multiplayer implementation appears relatively straightforward given their state synchronization architecture. The data model doesn’t strongly tie sessions to single authors, and authorship information flows through to each prompt sent to the coding agent. This suggests they designed for multiplayer from the beginning rather than retrofitting it later.

Client Interfaces and Interaction Patterns

Slack Integration

The Slack client embodies their philosophy of meeting users where they work. Beyond convenience, it creates a virality loop—as people use it publicly in channels, others observe and learn through osmosis. The interface is designed to require no special syntax or bot commands; users simply chat naturally, and the system interprets intent.

A critical piece of infrastructure is a classifier that determines which repository to work in. The system feeds the user’s message, thread context, and channel name to a fast model (specifically GPT-4o without reasoning) along with descriptions of all accessible repositories. The prompt includes hints about common repositories and example classifications, and crucially includes an “unknown” option so the AI can ask clarifying questions rather than guessing incorrectly. This reflects practical prompt engineering—providing explicit escape hatches prevents confidently wrong behavior.

The agent can post updates to Slack at important inflection points, using Block Kit for structured, appealing layouts that include metadata about repository and working status. The bot’s status clearly differentiates between working and completed states. Ramp also recommends adding custom emojis specific to the bot, acknowledging that personality and fun matter for adoption beyond pure functionality.

Web Interface

The web client works across desktop and mobile, leveraging Cloudflare’s Agents SDK for real-time streaming. Beyond basic chat, it exposes unique capabilities. A hosted VS Code instance runs inside the sandbox, enabling manual edits without local repository cloning. For web projects, a streamed desktop view allows computer use-style interaction where users can work alongside the agent, watching it navigate and verify changes visually. The system captures before-and-after screenshots that can be appended to pull request descriptions, providing visual documentation of changes.

A statistics page surfaces organizational usage metrics, particularly highlighting the percentage of sessions resulting in merged pull requests. Ramp identifies this as the most important metric, as merged PRs indicate valuable work rather than just activity. Metrics are shown over time to inspire growth. A “live humans prompting” count based on users who sent prompts in the last five minutes provides real-time engagement visibility.

Chrome Extension

The Chrome extension targets non-engineering users, particularly for visual changes to React applications. Using the Chrome extension sidebar API, it provides a chat interface with screenshot functionality. Rather than sending actual images (which consume many tokens), the extension leverages the DOM and React internals to extract the full element tree in selected areas. Ramp built their own integration tied to their React app’s debugging internals, though they mention React Grab as a potential starting point for others.

Distribution occurs through managed device policies rather than the Chrome Web Store, increasing adoption by automatically installing it in employees’ browsers. This requires standing up an extension update server that returns manifests and CRX files, configured through the ExtensionInstallForcelist MDM property. This distribution strategy suggests they view the tool as critical internal infrastructure rather than an optional productivity enhancement.

Workflow Design and User Experience

The system supports multiple workflow patterns through deliberate design choices. Ramp decided to queue follow-up prompts sent during execution rather than inserting them immediately. This simplifies management and enables users to send thoughts about next steps while the agent continues working. Sessions can be stopped mid-execution, providing necessary control when work goes off-track.

A distinctive feature is that sessions are “fast to start and effectively free to run,” eliminating the need to ration local checkouts or worktrees. Users can launch multiple versions of the same prompt simultaneously to see which succeeds, or try different approaches and swap models without concern about resource consumption. Unlimited concurrent sessions mean laptops don’t need to be involved at all, enabling scenarios like noticing a bug before bed, starting a session (with optional voice interaction), and reviewing the pull request in the morning.

All changes sync across interaction surfaces—chat, Slack, Chrome extension, web interface, pull request discussions, and the web-based VS Code editor—ensuring continuity as users switch between modalities. This seamless context preservation reduces friction in mixed-mode workflows.

Model and Provider Strategy

While specific model choices aren’t extensively detailed, the system supports “all frontier models,” indicating they maintain provider flexibility rather than coupling to specific APIs. The mention of using GPT-4o for classification tasks suggests they employ different models for different purposes based on speed and capability requirements. The absence of reasoning for the classification task indicates awareness that simpler, faster inference suffices for straightforward intent detection.

The architecture appears designed to abstract model selection from core functionality, allowing experimentation with different models or swapping providers as the landscape evolves. This reflects mature LLMOps thinking about avoiding vendor lock-in and maintaining optionality as model capabilities and economics change.

Adoption Metrics and Organizational Impact

The adoption curve is described as “vertical,” reaching approximately 30% of all merged pull requests across frontend and backend repositories within just a couple of months. Ramp emphasizes this was organic adoption rather than mandated usage, achieved by building to people’s needs, creating virality loops through public workspace integration, and letting product quality drive adoption.

The 30% figure is notable but warrants careful interpretation. It indicates the agent successfully handles a substantial portion of code contribution, but the text doesn’t specify what types of changes dominate this percentage—whether it’s primarily simple refactorings, bug fixes, feature implementations, or a mix. The emphasis on merged pull requests as the key metric is sound, as it filters out sessions that produced code deemed not worth merging.

The rapid adoption timeline suggests either the tool addresses significant pain points in Ramp’s development workflow or that internal evangelism and visibility were highly effective. The virality loops through public Slack usage likely accelerated discovery, while the multiplayer features may have facilitated peer learning.

Build-Versus-Buy Philosophy and Replicability

Ramp takes a strong stance on ownership, arguing that “anyone should be able to build this” and that owning tooling enables something “significantly more powerful than an off-the-shelf tool will ever be.” Their reasoning is that internal tools only need to work on your code, allowing deep customization that generic products can’t provide.

To support this philosophy, they published detailed specifications of their implementation, explicitly inviting others to paste the blog post into coding agents and begin replicating the system. This openness is somewhat unusual for a competitive advantage and may reflect belief that execution and integration matter more than architectural secrets, or that they gain more from broader ecosystem development than from keeping implementation details proprietary.

The reliance on specific infrastructure choices—Modal for sandboxes, Cloudflare Durable Objects for state, OpenCode as the agent foundation—means replication requires either adopting these same components or finding equivalents. The architecture isn’t infrastructure-agnostic, though the specifications provide enough detail for adaptation.

Critical Assessment and LLMOps Considerations

This case study represents sophisticated LLMOps implementation with several notable aspects. The comprehensive integration with production tooling—observability, feature flags, CI/CD, error tracking—reflects mature understanding that agents need the same context as human engineers. The verification loop is particularly important; merely generating code without automated validation would require human review for every change, limiting scalability.

The sandbox architecture addresses critical concerns about safety and isolation. Running sessions in ephemeral VMs prevents state contamination between sessions and limits blast radius if agents behave unexpectedly. The snapshot-based approach to rapid startup demonstrates creative infrastructure engineering, trading minor staleness for substantial performance gains.

The multiplayer functionality is genuinely innovative if it works as described, enabling collaboration patterns that pure single-user agents don’t support. The attribution mechanism for individual contributions within shared sessions suggests thoughtful design around accountability and credit.

However, several aspects warrant scrutiny. The 30% adoption figure, while impressive, doesn’t reveal complexity distribution. If agents primarily handle routine refactoring and simple bugs while humans tackle complex features, the productivity impact differs substantially from agents handling a representative cross-section of work. The emphasis on merged PRs is the right metric, but absent information about review overhead or bug rates post-merge, the full cost-benefit picture remains unclear.

The claim that frontier models are “smart enough to contain themselves” when given session-spawning capabilities is bold. While limiting agent self-replication is probably wise, whether current models reliably self-regulate without explicit constraints is an empirical question. The prompt engineering for this capability would be fascinating to examine.

The build-it-yourself philosophy has tradeoffs. Custom tooling provides deep integration and optimization for specific workflows but incurs substantial development and maintenance costs. As off-the-shelf tools mature, the differentiation window may narrow. The reliance on specific infrastructure providers (Modal, Cloudflare) creates dependencies that affect portability and potentially introduce vendor risk, though both are well-regarded platforms.

The GitHub token management approach—using individual user tokens rather than a bot account—is thoughtful security engineering that maintains code review integrity. However, it requires users to grant OAuth access, and token rotation and revocation introduce operational complexity.

The published specifications for replication are generous but also serve Ramp’s interests. By lowering barriers for others to build similar systems, they potentially accelerate the ecosystem of tooling around OpenCode and Modal, creating network effects that benefit all users including Ramp. It’s community-minded but also strategically savvy.

The voice interaction capability mentioned in passing suggests multimodal interfaces, though details are absent. If well-implemented, voice could reduce friction further, particularly for mobile scenarios they highlight (resuming sessions from the couch).

Overall, this represents a sophisticated, production-grade LLMOps implementation with strong infrastructure engineering, thoughtful workflow design, and genuine organizational adoption. The emphasis on verification, not just generation, and on meeting users across multiple interaction surfaces reflects mature product thinking. The willingness to build rather than buy enabled deep customization but required substantial investment. The rapid adoption suggests they successfully identified and addressed real workflow pain points within their engineering organization.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61