Brex: Autonomous AI Agents for Engineering Tasks with Closed-Loop Feedback Systems

Overview

Brex’s engineering blog post from April 2026 details their journey building a production-ready autonomous agent platform for handling repetitive engineering tasks. The use case emerged from a concrete business need: migrating gRPC client factories across more than 400 services in their monorepo. What began as a tactical solution to a specific migration problem evolved into a general-purpose platform for delegating well-scoped engineering work to AI agents. The case study is notable for its focus on infrastructure and integration challenges rather than model capabilities, emphasizing that the gap between “agents can do this” and “agents are doing this reliably” is primarily an infrastructure and systems design problem.

The post offers a balanced perspective by acknowledging both what worked and what didn’t, and by being explicit about the limitations and requirements of their approach. It’s a real production deployment with engineers actively using the system, not a proof-of-concept or demo.

The Problem Context

The initial business driver was a large-scale refactoring effort: centralizing gRPC client libraries that had been implemented inconsistently across hundreds of services over the years. Teams had built their own implementations, resulting in duplicated logic, inconsistent configurations, and dependency injection conflicts that only surfaced at deployment time. The fix was clear—centralize the client libraries and replace every factory—but the scope was too large for manual execution to be realistic.

Brex started by pairing human engineers with AI coding agents for these refactors. For simpler services (100-200 lines of changes), this worked well, with agents completing tasks in about 30 minutes one-shot. The work was repetitive enough that engineers could describe the pattern once and let the agent run. However, this initial success masked a fundamental problem that became apparent with more complex services.

The Core Infrastructure Challenge

The central technical challenge wasn’t model capability—it was connectivity and feedback loop closure. AI coding agents typically operate in isolated environments where they receive a repository, a task description, and a sandbox. They perform well in this controlled setting, but when deployed against real production infrastructure (live monorepos, actual CI/CD systems, production review bots), a specific failure mode emerged: agents would complete their changes, encounter automated feedback they couldn’t access, and either stop or continue blindly without incorporating that feedback.

The information agents needed already existed throughout Brex’s engineering infrastructure:

CI systems knew what failed and why, producing detailed logs
Review bots flagged style violations, security issues, and other problems
Test runners produced exact stack traces pointing to failures
Dependency injection frameworks surfaced configuration conflicts

This infrastructure had been built for human engineers, but agents had no programmatic access to it. The result was a broken workflow where every migration followed the same tedious loop: spin up a remote developer environment, kick off the agent, wait for completion, run CI, wait for CI, check if review bots left comments, manually copy those comments back to the agent, ask it to fix issues, wait again, and repeat until all checks passed. Engineers became expensive messengers, spending afternoons manually relaying information from automated systems back to agents.

Brex explicitly identifies this as “an expensive solution to a plumbing problem”—the bottleneck wasn’t human decision-making or judgment, but simple information relay.

The Solution Architecture

Brex’s solution centered on closing the feedback loop by giving agents programmatic access to the same automated systems that provide feedback to human engineers. They started with three Python scripts running continuously in the background:

One to forward Slack task requests to remote developer environments
One to feed PR bot comments back to agents
One to handle CI failures

This simple initial implementation validated the core hypothesis: if agents could access the same feedback loops as human engineers, they could iterate independently until tasks were complete. The first migration that ran start-to-finish without human intervention confirmed that closing the feedback loop was sufficient—the agents didn’t need more sophisticated models or capabilities, they just needed access to information.

The team then generalized this approach into a platform with several key architectural components:

Task Ingestion and Routing: Tasks enter the system from multiple sources—Slack messages, Linear tickets, GitHub issues, or cron jobs. An orchestrator routes incoming tasks to an agent pool, handling the distribution of work.

Environment Isolation: Each task receives its own remote developer environment (RDE). This isolation is critical for parallelization—it allows dozens of agents to work simultaneously on different tasks without interfering with each other. The post emphasizes that running one agent on one task is straightforward, but running 50 in parallel, each iterating independently, requires careful systems design around orchestration and environment isolation.

Full Toolchain Access: Agents receive the same toolchain that human engineers use. This includes access to the monorepo, build systems, test runners, CI/CD pipelines, code review bots, and other automated verification tools. The hypothesis, which the post notes is being validated by early PRs, is that “an agent without validation tooling is guessing.” Giving agents comprehensive tooling significantly improves output quality.

Closed Feedback Loops: This is the core innovation. Agents can read their own CI failures, access comments from code review bots, retrieve test results, and incorporate all this feedback into their iteration cycles. Instead of stopping when they encounter a failure, agents read the feedback, understand what needs to change, and apply fixes automatically.

Iterative Refinement: Agents work through multiple cycles independently—making changes, running tests, reading failures, applying fixes—until all automated checks pass. Only then do they surface a PR for human review.

Production Deployment and Usage Pattern

The intended usage pattern is straightforward from an engineer’s perspective: send a Slack message describing a bug or refactor in the morning, and find a green PR (all checks passing) waiting by the end of the day. No afternoon check-ins, no copying error logs, no manual intervention until human review of the final output.

Brex ran dozens of gRPC migrations using this system, with the process improving with each iteration. The time savings came not from faster code generation but from eliminating the human-in-the-loop relay pattern. Engineers were no longer spending afternoons managing agent workflows.

The platform continues to expand in scope. Brex is extending what agents can access beyond just CI and code review bots, including:

Internal tool verification systems
Test workload deployment capabilities
Browser-based checks for frontend changes

As the accessible surface area expands, the scope of delegable work grows proportionally. The architecture scales cleanly because each agent operates in isolation with its own feedback loops.

LLMOps Lessons and Production Insights

The case study offers several insights specifically relevant to deploying LLMs and agents in production settings:

Infrastructure Over Models: The post’s central thesis is that the gap between “agents can do this” and “agents are doing this reliably” is primarily an infrastructure problem. Integration work—connecting agents to existing systems—matters more than model capability improvements. This represents a pragmatic, production-oriented perspective that’s often missing from AI agent discussions.

Feedback Loop Closure as Core Requirement: Agents stall when they need information sitting in systems they can’t reach. The integration work may be “less exciting than the model capability question,” but it has higher impact on production reliability. This suggests that LLMOps for agentic systems requires significant investment in integrations and tooling infrastructure.

Environment Quality Determines Output Quality: The quality of the development environment and available tooling directly impacts agent output quality. This parallels human engineering workflows—better tools lead to better work—and suggests that treating agents as first-class users of engineering infrastructure is important.

Parallelization as Systems Design: Scaling from one agent to many requires careful architectural decisions around orchestration, environment isolation, resource management, and coordination. This isn’t primarily an AI problem but a distributed systems problem.

Scope Matters: The platform works best for tasks that are “tedious, well-defined, and common enough that nobody wants to own them manually.” This is a larger category than might be expected, and it grows as the platform capabilities expand, but it’s still bounded. The post doesn’t claim agents can handle all engineering work, just a specific class of repetitive, well-scoped tasks.

Governance and Attribution: The post explicitly calls out that agents should only access information necessary for their specific tasks (principle of least privilege), and that agent-generated work needs clear attribution back to actual humans when presented for review. This shows awareness of security, compliance, and accountability concerns critical for production systems.

Critical Assessment and Limitations

While the case study presents a successful deployment, several caveats and limitations deserve attention:

Task Scope Constraints: The system is explicitly designed for “well-defined” and “well-scoped” tasks. The gRPC migration example fits this pattern—repetitive refactoring following a known pattern. It’s unclear how well the approach generalizes to more open-ended or creative engineering work. The post doesn’t claim universal applicability, which is intellectually honest, but readers should recognize this is a solution for a specific class of problems.

Human Review Still Required: The workflow still terminates with human review before merging. This is appropriate and responsible, but means the system doesn’t fully automate the end-to-end process. The value proposition is eliminating the relay loop during iteration, not eliminating human oversight entirely.

Integration Overhead: While the post frames integration work as “less exciting” than model capabilities, it’s also potentially quite costly. Building and maintaining integrations with CI systems, review bots, test runners, and other infrastructure requires ongoing engineering investment. Organizations without similar infrastructure maturity might face higher barriers to adoption.

Environment Costs: Running each agent in its own remote developer environment provides isolation but has cost implications. The post doesn’t discuss resource utilization, compute costs, or environment lifecycle management, which would be important considerations for organizations evaluating similar approaches.

Success Metrics Absent: The post discusses outcomes qualitatively (“minimal engineer-hours spent,” “no afternoon check-ins”) but doesn’t provide quantitative metrics. How much time is actually saved per migration? What’s the error rate? How often do agents produce PRs that pass human review on first attempt? These metrics would help assess ROI.

Model Details Omitted: The post doesn’t specify which LLMs power the agents, what prompting strategies are used, how context is managed for large codebases, or how the system handles cases where models produce incorrect code. These details matter for reproducibility and for understanding the technical depth.

Failure Modes Unexplored: What happens when an agent can’t complete a task even with feedback? How does the system detect agents stuck in unproductive loops? What monitoring and alerting exists? Production systems need robust failure handling, but the post doesn’t detail these aspects.

Production Readiness Indicators

Despite these limitations, several indicators suggest this is a genuinely production-ready system rather than a prototype:

Active Use: The post states “engineers at Brex are using this right now,” indicating real adoption
Scale: Dozens of migrations completed across 400+ services demonstrates meaningful scale
Iteration: The process “got better with each one,” suggesting continuous improvement and production learning
Generalization: Evolution from scripts to a platform indicates systematic productionization
Multi-source Integration: Task ingestion from Slack, Linear, GitHub, and cron shows production integration breadth

Broader Implications for LLMOps

This case study illustrates a maturation path for agentic LLM systems in production. Rather than focusing on increasingly powerful models, Brex invested in integration infrastructure that allows existing models to operate more effectively. This suggests that for many production use cases, the bottleneck isn’t model intelligence but system design and infrastructure.

The emphasis on closed feedback loops has implications beyond code generation. Any domain where agents interact with automated systems—testing, deployment, monitoring, incident response—could benefit from similar architectural patterns. The general template of “trigger, dedicated environment, full toolchain, closed feedback loops” could apply widely.

The focus on environment isolation and parallelization also points toward infrastructure patterns borrowed from traditional distributed systems and containerization. LLMOps platforms may increasingly resemble orchestration systems like Kubernetes, managing pools of agents in isolated environments with careful resource management and lifecycle controls.

Finally, the case study demonstrates that production LLMOps requires thinking beyond model serving and prompt engineering to encompass the full integration ecosystem. Organizations serious about deploying agents need to invest in infrastructure that connects them to existing systems, not just infrastructure that serves model inference.

Autonomous AI Agents for Engineering Tasks with Closed-Loop Feedback Systems

Industry

Technologies

Overview

The Problem Context

The Core Infrastructure Challenge

The Solution Architecture

Production Deployment and Usage Pattern

LLMOps Lessons and Production Insights

Critical Assessment and Limitations

Production Readiness Indicators

Broader Implications for LLMOps

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

AI-Driven Development at Scale: Building a Firecracker MicroVM Platform with Autonomous Agents

Building and Deploying the Codex App: A Multi-Agent AI Development Environment