Brex developed an autonomous agent platform to handle repetitive engineering tasks like gRPC migrations across 400+ services in their monorepo. The initial problem was that AI coding agents would complete changes but couldn't access feedback from CI systems, review bots, and test runners, requiring engineers to manually relay information. Brex solved this by building a platform that closes the feedback loop—automatically forwarding CI failures, bot comments, and test results back to agents running in isolated remote developer environments. The system now handles migrations end-to-end without human intervention until final review, eliminating the need for engineers to spend afternoons copying error logs and relaying automated feedback.
Brex’s engineering blog post from April 2026 details their journey building a production-ready autonomous agent platform for handling repetitive engineering tasks. The use case emerged from a concrete business need: migrating gRPC client factories across more than 400 services in their monorepo. What began as a tactical solution to a specific migration problem evolved into a general-purpose platform for delegating well-scoped engineering work to AI agents. The case study is notable for its focus on infrastructure and integration challenges rather than model capabilities, emphasizing that the gap between “agents can do this” and “agents are doing this reliably” is primarily an infrastructure and systems design problem.
The post offers a balanced perspective by acknowledging both what worked and what didn’t, and by being explicit about the limitations and requirements of their approach. It’s a real production deployment with engineers actively using the system, not a proof-of-concept or demo.
The initial business driver was a large-scale refactoring effort: centralizing gRPC client libraries that had been implemented inconsistently across hundreds of services over the years. Teams had built their own implementations, resulting in duplicated logic, inconsistent configurations, and dependency injection conflicts that only surfaced at deployment time. The fix was clear—centralize the client libraries and replace every factory—but the scope was too large for manual execution to be realistic.
Brex started by pairing human engineers with AI coding agents for these refactors. For simpler services (100-200 lines of changes), this worked well, with agents completing tasks in about 30 minutes one-shot. The work was repetitive enough that engineers could describe the pattern once and let the agent run. However, this initial success masked a fundamental problem that became apparent with more complex services.
The central technical challenge wasn’t model capability—it was connectivity and feedback loop closure. AI coding agents typically operate in isolated environments where they receive a repository, a task description, and a sandbox. They perform well in this controlled setting, but when deployed against real production infrastructure (live monorepos, actual CI/CD systems, production review bots), a specific failure mode emerged: agents would complete their changes, encounter automated feedback they couldn’t access, and either stop or continue blindly without incorporating that feedback.
The information agents needed already existed throughout Brex’s engineering infrastructure:
This infrastructure had been built for human engineers, but agents had no programmatic access to it. The result was a broken workflow where every migration followed the same tedious loop: spin up a remote developer environment, kick off the agent, wait for completion, run CI, wait for CI, check if review bots left comments, manually copy those comments back to the agent, ask it to fix issues, wait again, and repeat until all checks passed. Engineers became expensive messengers, spending afternoons manually relaying information from automated systems back to agents.
Brex explicitly identifies this as “an expensive solution to a plumbing problem”—the bottleneck wasn’t human decision-making or judgment, but simple information relay.
Brex’s solution centered on closing the feedback loop by giving agents programmatic access to the same automated systems that provide feedback to human engineers. They started with three Python scripts running continuously in the background:
This simple initial implementation validated the core hypothesis: if agents could access the same feedback loops as human engineers, they could iterate independently until tasks were complete. The first migration that ran start-to-finish without human intervention confirmed that closing the feedback loop was sufficient—the agents didn’t need more sophisticated models or capabilities, they just needed access to information.
The team then generalized this approach into a platform with several key architectural components:
Task Ingestion and Routing: Tasks enter the system from multiple sources—Slack messages, Linear tickets, GitHub issues, or cron jobs. An orchestrator routes incoming tasks to an agent pool, handling the distribution of work.
Environment Isolation: Each task receives its own remote developer environment (RDE). This isolation is critical for parallelization—it allows dozens of agents to work simultaneously on different tasks without interfering with each other. The post emphasizes that running one agent on one task is straightforward, but running 50 in parallel, each iterating independently, requires careful systems design around orchestration and environment isolation.
Full Toolchain Access: Agents receive the same toolchain that human engineers use. This includes access to the monorepo, build systems, test runners, CI/CD pipelines, code review bots, and other automated verification tools. The hypothesis, which the post notes is being validated by early PRs, is that “an agent without validation tooling is guessing.” Giving agents comprehensive tooling significantly improves output quality.
Closed Feedback Loops: This is the core innovation. Agents can read their own CI failures, access comments from code review bots, retrieve test results, and incorporate all this feedback into their iteration cycles. Instead of stopping when they encounter a failure, agents read the feedback, understand what needs to change, and apply fixes automatically.
Iterative Refinement: Agents work through multiple cycles independently—making changes, running tests, reading failures, applying fixes—until all automated checks pass. Only then do they surface a PR for human review.
The intended usage pattern is straightforward from an engineer’s perspective: send a Slack message describing a bug or refactor in the morning, and find a green PR (all checks passing) waiting by the end of the day. No afternoon check-ins, no copying error logs, no manual intervention until human review of the final output.
Brex ran dozens of gRPC migrations using this system, with the process improving with each iteration. The time savings came not from faster code generation but from eliminating the human-in-the-loop relay pattern. Engineers were no longer spending afternoons managing agent workflows.
The platform continues to expand in scope. Brex is extending what agents can access beyond just CI and code review bots, including:
As the accessible surface area expands, the scope of delegable work grows proportionally. The architecture scales cleanly because each agent operates in isolation with its own feedback loops.
The case study offers several insights specifically relevant to deploying LLMs and agents in production settings:
Infrastructure Over Models: The post’s central thesis is that the gap between “agents can do this” and “agents are doing this reliably” is primarily an infrastructure problem. Integration work—connecting agents to existing systems—matters more than model capability improvements. This represents a pragmatic, production-oriented perspective that’s often missing from AI agent discussions.
Feedback Loop Closure as Core Requirement: Agents stall when they need information sitting in systems they can’t reach. The integration work may be “less exciting than the model capability question,” but it has higher impact on production reliability. This suggests that LLMOps for agentic systems requires significant investment in integrations and tooling infrastructure.
Environment Quality Determines Output Quality: The quality of the development environment and available tooling directly impacts agent output quality. This parallels human engineering workflows—better tools lead to better work—and suggests that treating agents as first-class users of engineering infrastructure is important.
Parallelization as Systems Design: Scaling from one agent to many requires careful architectural decisions around orchestration, environment isolation, resource management, and coordination. This isn’t primarily an AI problem but a distributed systems problem.
Scope Matters: The platform works best for tasks that are “tedious, well-defined, and common enough that nobody wants to own them manually.” This is a larger category than might be expected, and it grows as the platform capabilities expand, but it’s still bounded. The post doesn’t claim agents can handle all engineering work, just a specific class of repetitive, well-scoped tasks.
Governance and Attribution: The post explicitly calls out that agents should only access information necessary for their specific tasks (principle of least privilege), and that agent-generated work needs clear attribution back to actual humans when presented for review. This shows awareness of security, compliance, and accountability concerns critical for production systems.
While the case study presents a successful deployment, several caveats and limitations deserve attention:
Task Scope Constraints: The system is explicitly designed for “well-defined” and “well-scoped” tasks. The gRPC migration example fits this pattern—repetitive refactoring following a known pattern. It’s unclear how well the approach generalizes to more open-ended or creative engineering work. The post doesn’t claim universal applicability, which is intellectually honest, but readers should recognize this is a solution for a specific class of problems.
Human Review Still Required: The workflow still terminates with human review before merging. This is appropriate and responsible, but means the system doesn’t fully automate the end-to-end process. The value proposition is eliminating the relay loop during iteration, not eliminating human oversight entirely.
Integration Overhead: While the post frames integration work as “less exciting” than model capabilities, it’s also potentially quite costly. Building and maintaining integrations with CI systems, review bots, test runners, and other infrastructure requires ongoing engineering investment. Organizations without similar infrastructure maturity might face higher barriers to adoption.
Environment Costs: Running each agent in its own remote developer environment provides isolation but has cost implications. The post doesn’t discuss resource utilization, compute costs, or environment lifecycle management, which would be important considerations for organizations evaluating similar approaches.
Success Metrics Absent: The post discusses outcomes qualitatively (“minimal engineer-hours spent,” “no afternoon check-ins”) but doesn’t provide quantitative metrics. How much time is actually saved per migration? What’s the error rate? How often do agents produce PRs that pass human review on first attempt? These metrics would help assess ROI.
Model Details Omitted: The post doesn’t specify which LLMs power the agents, what prompting strategies are used, how context is managed for large codebases, or how the system handles cases where models produce incorrect code. These details matter for reproducibility and for understanding the technical depth.
Failure Modes Unexplored: What happens when an agent can’t complete a task even with feedback? How does the system detect agents stuck in unproductive loops? What monitoring and alerting exists? Production systems need robust failure handling, but the post doesn’t detail these aspects.
Despite these limitations, several indicators suggest this is a genuinely production-ready system rather than a prototype:
This case study illustrates a maturation path for agentic LLM systems in production. Rather than focusing on increasingly powerful models, Brex invested in integration infrastructure that allows existing models to operate more effectively. This suggests that for many production use cases, the bottleneck isn’t model intelligence but system design and infrastructure.
The emphasis on closed feedback loops has implications beyond code generation. Any domain where agents interact with automated systems—testing, deployment, monitoring, incident response—could benefit from similar architectural patterns. The general template of “trigger, dedicated environment, full toolchain, closed feedback loops” could apply widely.
The focus on environment isolation and parallelization also points toward infrastructure patterns borrowed from traditional distributed systems and containerization. LLMOps platforms may increasingly resemble orchestration systems like Kubernetes, managing pools of agents in isolated environments with careful resource management and lifecycle controls.
Finally, the case study demonstrates that production LLMOps requires thinking beyond model serving and prompt engineering to encompass the full integration ecosystem. Organizations serious about deploying agents need to invest in infrastructure that connects them to existing systems, not just infrastructure that serves model inference.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Atlassian built Fireworks, a Firecracker microVM orchestration platform on Kubernetes, in just four weeks using their Rovo Dev AI agent system with minimal human-written code. The challenge was to create a secure execution engine for Atlassian's AI agent infrastructure with advanced features like 100ms warm starts, live migration, and eBPF network policy enforcement—a project that would have been considered too complex and time-consuming for a traditional development approach. By treating AI agents as full engineering team members with end-to-end access to development, deployment, testing, and CI/CD pipelines, and establishing robust validation through AI-written e2e tests and progressive rollouts, they successfully delivered a production-ready platform that demonstrates how agentic workflows can fundamentally transform software development velocity and scope.
OpenAI's Codex team developed a dedicated GUI application for AI-powered coding that serves as a command center for multi-agent systems, moving beyond traditional IDE and terminal interfaces. The team addressed the challenge of making AI coding agents accessible to broader audiences while maintaining professional-grade capabilities for software developers. By combining the GPT-5.3 Codex model with agent skills, automations, and a purpose-built interface, they created a production system that enables delegation-based development workflows where users supervise AI agents performing complex coding tasks. The result was over one million downloads in the first week, widespread internal adoption at OpenAI including by research teams, and a strategic shift positioning AI coding tools for mainstream use, culminating in a Super Bowl advertisement.