ZenML

Actionable CI: Intelligent Analysis and Auto-Remediation of CI Pipeline Failures

Block 2026
View original source

Block's engineering team faced a critical bottleneck where thousands of engineers struggled to understand complex CI pipeline failures across large, interconnected repositories. Their DX team built "Actionable CI," a three-layer intelligent system combining static analysis for known failure patterns, LLM-based analysis for grouping and explaining issues in plain language, and an agentic autofix capability using Goose to automatically generate, validate, and submit draft pull requests for eligible failures. The system integrates directly into CI results pages and exposes programmatic access via MCP servers, enabling both human developers and AI coding agents to efficiently diagnose and remediate build failures without manual intervention.

Industry

Finance

Technologies

Overview

Block, a financial technology company, developed “Actionable CI” to address a critical developer experience bottleneck in their continuous integration infrastructure. With thousands of engineers working across large, interconnected repositories, the company’s thorough CI pipelines generated overwhelming failure output that developers struggled to interpret and act upon. The team recognized that the challenge wasn’t running CI tests but understanding what went wrong when they failed. This case study demonstrates a thoughtful, layered approach to LLMOps that combines deterministic methods with increasingly sophisticated AI capabilities, ultimately extending to full agentic automation.

Problem Context and Motivation

The core problem at Block was one of cognitive overload and developer productivity. When builds failed in their comprehensive CI pipelines, engineers were confronted with extensive logs that required significant time to parse and understand. A single root cause—such as a broken import—could cascade into fifteen or more failing test suites, creating the appearance of fifteen distinct problems when in reality only one issue existed. This lack of clarity created friction in the development workflow and slowed down the entire engineering organization. The DX (Developer Experience) team posed a transformative question: what if CI failures automatically came with explanations, root cause analysis, and ideally, proposed fixes?

Three-Layer Architecture

Block’s solution is architected as a three-layer pipeline, with each layer designed to handle different aspects of the problem with appropriate tools and techniques. This design philosophy is notable for its pragmatism—rather than applying LLMs indiscriminately, the system reserves AI capabilities for situations where they provide genuine value beyond simpler deterministic approaches.

Layer 1: Static Analysis Foundation

The first layer employs entirely deterministic static analysis to scan artifacts from failed CI jobs. This layer matches common, well-known failure patterns using rules-based detection—what the team describes as “the known knowns.” These patterns include dependency conflicts, import violations, and common configuration mistakes. The rationale for this layer is both economic and practical: language models are slower and more expensive than pattern matching with regular expressions. When a failure can be identified deterministically, that approach is preferred. This layer handles straightforward cases efficiently and produces structured signals that inform the subsequent layers. From an LLMOps perspective, this demonstrates an important principle: not every problem requires AI, and a well-designed production system should use the most appropriate tool for each task, reserving expensive LLM calls for scenarios where they add unique value.

Layer 2: LLM-Based Analysis and Grouping

After static analysis, the system bundles logs from failing jobs and submits them to a language model with specific instructions. The LLM is tasked with identifying distinct issues present in the logs, explaining each issue in plain language with awareness of the code changes on the branch, and crucially, grouping similar issues across multiple failing jobs. This grouping capability transforms the developer experience—instead of seeing fifteen apparently separate failures from a single broken import, the developer sees one clearly explained root cause. The analysis is fetched and cached immediately when a build fails, so by the time a developer navigates to the CI results, the explanation is already available. This represents a core LLMOps pattern: using LLMs for synthesis, pattern recognition across unstructured data, and natural language explanation in situations where deterministic approaches fall short. The system maintains awareness of code changes on the branch, suggesting that the prompt includes context about what the developer modified, enabling more relevant and targeted analysis.

Layer 3: Agentic Autofix with Validation Loop

The third layer extends beyond analysis to automated remediation for certain failure categories, including compile errors, dependency violations, lint failures, and some unit test regressions. This agentic capability is explicitly initiated by developers from the CI results interface rather than running automatically, giving engineers control over when to invoke AI-driven fixes.

The autofix workflow demonstrates sophisticated LLMOps orchestration. First, an eligibility check determines whether all detected issues are supported types and whether the total issue count is below a threshold—if any issue isn’t a good autofix candidate, the system skips this layer entirely. This gating mechanism prevents the agent from attempting fixes in scenarios where it’s unlikely to succeed, reducing wasted computational resources and avoiding unhelpful automation.

When autofix is invoked, the analysis from the first two layers is sent to a headless instance of Goose (an AI coding agent) with instructions to generate a fix. Goose then opens a draft pull request with proposed changes. Critically, the system includes a validation loop: CI runs on the draft PR, and if it passes, the process continues; if it fails, Goose analyzes the new failure and retries up to a configured limit. This retry mechanism with failure analysis represents a key agentic pattern—the system can iteratively refine its solution based on test feedback.

If the draft PR passes CI and the original branch hasn’t changed in the meantime, the draft is promoted to “ready for review” and the developer is notified that their build has been fixed. The system also includes graceful degradation: if the developer has already pushed their own fix while the agent was working, the autofix PR is quietly closed to avoid noise. This attention to developer experience and edge case handling reflects production-ready LLMOps engineering.

Integration and Developer Experience

Block integrated Actionable CI directly into their existing CI results page rather than building a separate tool. This design decision reduces friction and ensures the capability feels like a native part of the workflow. For failures eligible for autofix, developers see an option directly in the CI results; clicking it initiates the background draft PR generation and validation process.

For failures not eligible for autofix, the interface provides multiple pathways for developers to take action. Each issue includes an “Explain” button that generates plain-language breakdowns grounded in actual code changes and logs. Developers can copy a fix prompt to their clipboard or launch it directly into their preferred AI tool. Raw logs remain one click away for engineers who want to investigate manually. This multi-tiered interface accommodates different developer preferences and skill levels while ensuring the AI assistance enhances rather than replaces developer agency.

Agent Experience as Infrastructure

A particularly forward-looking aspect of this implementation is the recognition that “Agent Experience” is becoming a key component of Developer Experience. Block built an MCP (Model Context Protocol) server and agent skills that expose CI analysis and fix capabilities programmatically. This means AI coding agents can access CI failure analysis, generate fixes, and validate them against CI in an automated loop without requiring a developer to manually check results.

This architectural decision transforms Actionable CI from a developer-facing tool into infrastructure that both humans and agents can consume. An AI coding agent working on behalf of a developer can pull failure analysis, propose a fix, submit it, observe the CI results, and iterate—all programmatically. This represents an emerging LLMOps pattern: building systems that serve both human and agent consumers, with APIs and interfaces designed for programmatic access by autonomous systems. The use of MCP, an emerging standard for agent-to-tool communication, suggests Block is thinking about interoperability and ecosystem integration.

LLMOps Considerations and Design Patterns

Several notable LLMOps patterns and considerations emerge from this case study:

Layered Intelligence Architecture: The three-layer design demonstrates how production LLM systems can combine deterministic, AI-powered, and agentic capabilities in a coherent pipeline. Each layer has clear responsibilities and the system routes work appropriately based on problem characteristics.

Prompt Engineering with Context: The LLM analysis layer receives not just raw logs but also awareness of code changes on the branch. This context injection is crucial for relevant analysis and represents thoughtful prompt design that goes beyond simply dumping data into an LLM.

Caching and Latency Management: Analysis is fetched and cached as soon as builds fail, ensuring minimal perceived latency when developers view results. This proactive caching pattern is essential for production LLM applications where user-facing latency is critical.

Eligibility Gating: Before invoking expensive agentic workflows, the system checks whether conditions are appropriate for success. This reduces wasted computation and prevents unhelpful automation, representing responsible resource management in production LLM systems.

Validation Loops with Retry Logic: The autofix agent doesn’t just generate code and hope for the best—it validates changes against CI and can retry with failure analysis if the first attempt doesn’t work. This represents robust error handling and self-correction in agentic systems.

Graceful Degradation and Edge Case Handling: The system handles scenarios like concurrent developer fixes, changed branches, and persistent failures without creating confusion or noise. This attention to edge cases is characteristic of production-ready systems.

Human-in-the-Loop Design: Autofix is explicitly initiated by developers rather than running automatically. This preserves developer agency and ensures AI assistance augments rather than replaces human judgment.

Multi-Modal Output: The system provides plain-language explanations, copyable prompts, direct integration with developer tools, and access to raw logs. This accommodates different developer workflows and preferences.

Critical Assessment

While the case study presents an impressive implementation, several aspects warrant balanced consideration. First, the text comes from Block’s engineering blog and is explicitly framed as a recruitment tool (“Inspired by our work? Join Block’s engineering team”), so claims should be evaluated with appropriate skepticism. The case study doesn’t provide quantitative metrics on adoption rates, time savings, autofix success rates, or developer satisfaction, making it difficult to assess actual impact beyond the architectural description.

The reliance on Goose, a specific AI coding agent, introduces a dependency on an external tool whose capabilities and limitations aren’t fully explored in the case study. Questions remain about error rates, the percentage of failures eligible for autofix, how often the validation loop succeeds versus exhausts retries, and what happens when Goose generates fixes that pass CI but introduce subtle bugs or regressions not caught by tests.

The grouping of similar issues across failing jobs is described as an LLM capability, but the case study doesn’t detail how grouping quality is evaluated or what happens when the LLM incorrectly groups unrelated issues or fails to group related ones. Similarly, the “awareness of code changes on the branch” suggests sophisticated context management, but the implementation details—such as how much context is included, how it’s formatted, and token budget management—aren’t addressed.

The MCP server and agent skills for programmatic access represent forward-looking infrastructure, but adoption patterns, security considerations for agents making automated code changes, and governance mechanisms aren’t discussed. In production environments with thousands of engineers, questions about access control, audit trails, and preventing agent-generated technical debt become important.

Despite these unknowns, the case study demonstrates several strengths. The layered architecture shows mature thinking about when to use AI versus deterministic approaches. The validation loop with retry logic represents robust agentic design. The integration directly into existing developer workflows reduces adoption friction. And the dual focus on human and agent consumers suggests strategic thinking about the evolving landscape of software development.

Conclusion

Block’s Actionable CI represents a thoughtful application of LLMOps principles to developer tooling, specifically addressing CI pipeline failure diagnosis and remediation. The three-layer architecture demonstrates how production systems can combine static analysis, LLM-powered synthesis and explanation, and agentic automation in a coherent workflow that matches capabilities to problem characteristics. The system’s design reflects several emerging best practices in LLMOps: appropriate use of AI only where it adds value, proactive caching for latency management, validation loops for agentic workflows, eligibility gating to prevent wasted computation, and infrastructure designed to serve both human and agent consumers. While the case study would benefit from quantitative metrics and deeper discussion of limitations and failure modes, it provides valuable insights into how large engineering organizations are operationalizing LLMs to enhance developer productivity in continuous integration workflows.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Agentic AI for Cloud Migration and Application Modernization at Scale

Commonwealth Bank of Australia 2025

Commonwealth Bank of Australia (CBA) partnered with AWS ProServe to modernize legacy Windows 2012 applications and migrate them to cloud at scale. Facing challenges with time-consuming manual processes, missing documentation, and significant technical debt, CBA developed "Lumos," an internal multi-agent AI platform that orchestrates the entire modernization lifecycle—from application analysis and design through code transformation, testing, deployment, and operations. By integrating AI agents with deterministic engines and AWS services (Bedrock, ECS, OpenSearch, etc.), CBA increased their modernization velocity from 10 applications per year to 20-30 applications per quarter, while maintaining security, compliance, and quality standards through human-in-the-loop validation and multi-agent review processes.

code_generation legacy_system_integration high_stakes_application +34

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41