Ramp: Building an Internal Background Coding Agent with Full Development Environment Integration

Overview and Context

Ramp developed Inspect, an internal background coding agent designed to generate code with the full verification capabilities of a human engineer. This case study presents a notable example of LLMOps in production within the finance technology sector, where Ramp operates. The fundamental problem Inspect addresses is that traditional coding agents can generate code but often lack the complete context and tooling necessary to verify their work comprehensively. Ramp’s solution was to build an agent that doesn’t just write code but can also run tests, review telemetry, query feature flags, and perform visual verification—essentially having “agency” in the truest sense. The results have been impressive: within a couple of months, approximately 30% of all pull requests merged to Ramp’s frontend and backend repositories were generated by Inspect, and this adoption occurred organically without forcing engineers to abandon their preferred tools.

Technical Architecture and Infrastructure

The core infrastructure choice for Inspect is Modal, a cloud platform for AI infrastructure that Ramp uses across their organization. Modal’s sandboxing capabilities are central to the system’s design, enabling near-instant startup of full development environments. Each Inspect session runs in its own sandboxed VM that contains everything an engineer would have available locally, including Vite for frontend tooling, Postgres for database access, and Temporal for workflow orchestration. The architecture prioritizes session speed, with the design principle being that session velocity should only be limited by the model provider’s time-to-first-token, not by infrastructure overhead.

The system employs an image registry approach where each code repository has a defined image that is rebuilt every 30 minutes. These images include cloned repositories, installed runtime dependencies, and completed initial setup and build commands. This pre-building strategy means that when engineers start a new session, all the time-consuming setup work has already been completed. Modal’s file system snapshots feature allows Ramp to freeze and restore state, which is crucial for maintaining session continuity and enabling engineers to pick up where they left off.

Integration Ecosystem

Inspect’s power comes significantly from its deep integration with Ramp’s existing engineering toolchain. The agent is wired into multiple critical systems: Sentry for error tracking, Datadog for monitoring and observability, LaunchDarkly for feature flag management, Braintrust (likely for AI evaluation and observability), GitHub for version control, Slack for communication, and Buildkite for continuous integration. These integrations mean that Inspect isn’t operating in isolation—it has access to the same signals and data sources that human engineers use to make decisions about code quality and correctness.

For backend work, Inspect can run automated tests to verify functionality and query feature flags to understand deployment configurations. For frontend work, the agent performs visual verification and provides users with both screenshots and live previews, allowing stakeholders to see exactly what changes look like before merging. This comprehensive verification approach represents a significant advancement over coding agents that merely generate code without validating it.

Multi-Model Support and Flexibility

Inspect supports all frontier models, demonstrating Ramp’s recognition that no single model is optimal for all tasks and that the LLM landscape continues to evolve rapidly. The system also supports Model Context Protocol (MCP), which likely provides standardized ways for the agent to interact with various external systems and data sources. Additionally, Inspect includes custom tools and skills that encode Ramp-specific knowledge about “how we ship at Ramp,” representing a form of organizational knowledge capture that makes the agent more effective in Ramp’s specific context.

This multi-model approach is pragmatic from an LLMOps perspective: it hedges against model-specific limitations, allows engineers to experiment with different models for different tasks, and future-proofs the system as new models emerge. The architecture separates the infrastructure and integration layer from the model selection, which is a sound design principle for production AI systems.

User Experience and Workflow Integration

Ramp designed Inspect to accommodate the diverse workflows that engineers actually use rather than forcing them into a single prescribed interaction pattern. Engineers can interact with Inspect through multiple interfaces: chatting with it in Slack (including sending screenshots), using a Chrome extension to highlight specific UI elements for modification, prompting through a web interface, discussing on GitHub pull requests, or dropping into a web-based VS Code editor for manual changes. Critically, all changes are synchronized to the session regardless of which interface was used, preventing the frustrating loss of work that can occur when switching between tools.

The system also includes voice interaction capabilities, which aligns with the scenario where an engineer might notice a bug late at night and want to quickly capture the issue and kick off a fix without typing. Every Inspect session is multiplayer by default, allowing engineers to share sessions with colleagues who can then collaborate on bringing the work to completion. This collaborative approach recognizes that coding is often a team activity and that getting work “across the finish line” may require multiple perspectives.

Concurrency and Resource Management

A significant operational advantage of Inspect is its approach to concurrency and resource management. Because sessions are fast to start and “effectively free to run” (on Modal’s infrastructure), engineers don’t need to ration local checkouts or Git worktrees. This eliminates a common friction point in development workflows where managing multiple concurrent branches or experimental approaches requires careful coordination of local resources. Engineers can kick off multiple versions of the same prompt simultaneously and simply evaluate which one works best. They can experiment with different approaches or swap models without concern about resource constraints or cleanup overhead.

This unlimited concurrency model has important implications for how engineers interact with the system. Rather than carefully crafting a single perfect prompt, engineers can adopt a more exploratory approach, generating multiple candidates and selecting the best result. This aligns well with how frontier models actually perform—probabilistic systems that may produce varying quality results—and shifts the workflow toward generation and selection rather than iterative refinement of a single attempt.

Validation and Testing Approach

The emphasis on verification distinguishes Inspect from simpler code generation tools. For backend code, the agent doesn’t just write functions—it runs the test suite to verify correctness. It can review telemetry data (via Datadog integration) to understand performance characteristics or identify issues in production-like environments. It can query feature flags (via LaunchDarkly) to understand which code paths are active for different user segments. This comprehensive testing approach means that Inspect-generated pull requests arrive with evidence of correctness rather than requiring human engineers to perform all verification from scratch.

For frontend code, visual verification is particularly valuable. Frontend development often involves subjective assessments about layout, styling, and user experience that are difficult to capture in automated tests alone. By providing screenshots and live previews, Inspect allows product managers, designers, and engineers to quickly assess whether the generated code achieves the intended result. This closes a verification loop that purely code-based agents cannot address.

Adoption and Organizational Impact

The adoption metrics are striking: reaching 30% of merged pull requests within just a couple of months represents rapid organizational uptake. Ramp emphasizes that this adoption was organic—they “didn’t force anyone to use Inspect over their own tools.” Instead, adoption was driven by building to people’s needs, creating “virality loops through letting it work in public spaces” (likely referring to Slack integration and visible collaborative sessions), and letting the product demonstrate its value.

This organic adoption pattern is significant from an LLMOps perspective because it suggests that the system is genuinely providing value rather than being used merely to meet a mandate. When engineers voluntarily shift 30% of their PR creation to an agent, it indicates that the agent is competitive with manual coding in terms of quality, speed, or ease of use. The continued growth trajectory mentioned in the case study suggests that engineers are expanding their use of Inspect as they become more comfortable with its capabilities and limitations.

The system also democratizes contribution by providing builders of all backgrounds with “the tooling and setup an engineer would” have. This suggests that Inspect may be enabling product managers, designers, or other non-engineering roles to contribute code directly, which could represent a significant shift in how cross-functional teams operate.

Build vs. Buy Philosophy

Ramp makes an explicit argument for building rather than buying coding agent infrastructure: “Owning the tooling lets you build something significantly more powerful than an off-the-shelf tool will ever be. After all, it only has to work on your code.” This philosophy reflects a particular stance in the LLMOps landscape—that deeply integrated, company-specific tooling can outperform generic solutions precisely because it can be optimized for specific workflows, codebases, and organizational practices.

The custom skills that “encode how we ship at Ramp” represent organizational knowledge that would be difficult or impossible to capture in a general-purpose product. The deep integrations with Ramp’s specific tool stack (their exact Sentry configuration, their Datadog dashboards, their LaunchDarkly flag structure) create a level of context-awareness that generic tools cannot match. However, this approach also requires significant engineering investment to build and maintain, which may not be feasible or worthwhile for all organizations.

To encourage others to replicate their approach, Ramp published a specification of their implementation, suggesting that they believe this architecture represents a generalizable pattern that other engineering organizations should consider. This open approach to sharing architectural details is relatively unusual in the LLMOps space and may indicate confidence that their competitive advantage lies in execution rather than in keeping the architecture secret.

Infrastructure Choices and Trade-offs

The choice of Modal as the infrastructure provider is central to Inspect’s design. Modal specializes in cloud infrastructure for AI workloads and provides primitives like sandboxes and file system snapshots that are specifically suited to the coding agent use case. The near-instant startup times for sandboxes make the unlimited concurrency model practical—if spinning up a new environment took minutes rather than seconds, the user experience would be fundamentally different.

However, this choice also creates a dependency on Modal’s platform and pricing model. When Ramp describes sessions as “effectively free to run,” this likely reflects Modal’s specific pricing structure and possibly volume discounts that Ramp receives as a significant customer. Organizations evaluating similar architectures would need to carefully consider infrastructure costs at their own scale.

The decision to rebuild images every 30 minutes represents a trade-off between image freshness and build overhead. More frequent rebuilds would reduce the maximum staleness of development environments but would consume more compute resources and potentially create more churn. Less frequent rebuilds would risk developers working against outdated dependencies. The 30-minute interval suggests Ramp’s codebase and dependencies change frequently enough that this refresh rate provides meaningful value.

GitHub Integration and Authentication

The case study mentions using a GitHub app with installation tokens that are generated on each clone, allowing the system to access repositories “without knowing what user will consume it.” This approach to authentication is important for multi-tenancy and security. Rather than using a single service account or requiring users to provide their own credentials, the GitHub app model provides repository access while maintaining auditability and allowing fine-grained permission control.

This authentication pattern is particularly relevant for LLMOps systems that need to interact with version control systems on behalf of multiple users. The approach balances the need for automated access (the agent needs to clone repos and create PRs) with security considerations (limiting blast radius if credentials are compromised) and user experience (engineers shouldn’t need to re-authenticate constantly).

Critical Assessment and Limitations

While the case study presents impressive results, several caveats and questions warrant consideration. First, the 30% pull request metric doesn’t directly indicate code quality or maintenance burden. If Inspect-generated PRs require more review time, generate more bugs, or create technical debt, the net productivity impact might be less positive than the raw number suggests. The case study doesn’t provide data on bug rates, review turnaround times, or long-term maintainability of Inspect-generated code.

Second, the “effectively free” claim for session costs deserves scrutiny. While infrastructure costs may be low on a per-session basis, the aggregate cost of running thousands of concurrent sessions, storing snapshots, and maintaining the image registry could be substantial. Organizations considering similar approaches should model costs carefully at their expected scale.

Third, the case study acknowledges that session speed and quality are “only limited by model intelligence itself,” which is an important limitation. Even with perfect tooling and infrastructure, the coding agents are ultimately bounded by the capabilities of the underlying language models. As of early 2026, frontier models still make mistakes, hallucinate, struggle with complex reasoning, and require human oversight. The infrastructure Inspect provides makes these models more effective, but doesn’t eliminate their fundamental limitations.

Fourth, the build-vs-buy recommendation may not generalize to all organizations. Ramp has the engineering resources and AI infrastructure expertise to build and maintain a sophisticated internal tool. Smaller organizations or those with less AI-focused engineering cultures might be better served by off-the-shelf solutions, even if those solutions are less deeply integrated. The opportunity cost of building Inspect—the other projects those engineers could have worked on—isn’t discussed in the case study.

Finally, the case study is incomplete (it appears to cut off mid-sentence), which limits our understanding of certain implementation details, particularly around Git operations and user attribution for agent-generated commits.

LLMOps Maturity Indicators

This case study demonstrates several markers of LLMOps maturity. The multi-model support and MCP integration show architectural flexibility and recognition that the model landscape is evolving. The comprehensive integration with monitoring, logging, and feature flag systems indicates that Inspect is treated as a first-class part of the engineering infrastructure rather than as an experimental side project. The emphasis on verification and testing shows understanding that code generation is only part of the development workflow—validation is equally critical.

The organic adoption pattern and lack of mandated usage suggest organizational trust in the system’s reliability. Engineers voting with their feet to use Inspect for 30% of their PRs indicates that the system has crossed a threshold of usefulness where it’s competitive with manual coding for at least some tasks. The multiplayer sessions and multiple interaction modalities demonstrate attention to real-world workflows rather than forcing engineers into a prescribed interaction pattern.

The infrastructure investment in pre-built images, snapshot management, and instant session startup reflects an understanding that user experience—particularly latency—is critical for tool adoption. Engineers will abandon tools that feel slow or friction-filled, even if those tools are technically capable. By making Inspect sessions “strictly better than local” in terms of speed and resource availability, Ramp removed common objections to using a cloud-based development environment.

Conclusion

Inspect represents a sophisticated implementation of coding agents in production, distinguished by its comprehensive approach to verification, deep integration with engineering tooling, and thoughtful infrastructure design. The system demonstrates that coding agents can achieve meaningful adoption in demanding production environments when they’re given appropriate context and tools. However, the case study also illustrates the significant engineering investment required to build such systems and the ongoing dependency on frontier model capabilities. Organizations evaluating similar approaches should carefully consider whether the benefits of deep customization justify the build-and-maintain costs compared to evolving off-the-shelf alternatives.

Building an Internal Background Coding Agent with Full Development Environment Integration

Industry

Technologies