ZenML

Security-Focused LLM Agent Harness for Automated Vulnerability Discovery

Cloudflare 2026
View original source

Cloudflare deployed Anthropic's Mythos Preview model as part of Project Glasswing to identify security vulnerabilities across their own infrastructure and codebases. The problem was that traditional vulnerability scanning tools and generic coding agents proved insufficient for comprehensive security research at scale, missing complex exploit chains and generating excessive false positives. Cloudflare developed a sophisticated multi-stage harness architecture that orchestrates multiple specialized agents working in parallel, each with narrow, focused scopes. This harness includes reconnaissance, hunting, validation, gap-filling, deduplication, tracing, feedback loops, and structured reporting stages. The results showed Mythos Preview represents a significant advance over previous frontier models, particularly in exploit chain construction and proof-of-concept generation, though challenges remain around model refusals, signal-to-noise ratios, and the need for architectural defenses rather than just faster patching.

Industry

Tech

Technologies

Overview

Cloudflare’s Project Glasswing case study presents a sophisticated production deployment of security-focused LLMs for automated vulnerability discovery across their infrastructure. The company participated in Anthropic’s Project Glasswing program, gaining early access to Mythos Preview, a security-specialized frontier model. The case study is notable not just for deploying a cutting-edge model, but for the extensive harness architecture Cloudflare built around it to make the model operationally viable at scale. This represents a mature LLMOps implementation where the infrastructure and orchestration layer proves as critical as the model itself.

The context is important: Cloudflare had already been running security-focused LLMs against their code for months before accessing Mythos Preview. This prior experience gave them baseline comparisons and shaped their architectural decisions. They tested Mythos Preview against more than fifty of their own repositories, spanning their runtime, edge data path, protocol stack, control plane, and open-source dependencies. This wasn’t a proof-of-concept but a production security tool deployment with real operational consequences.

Model Capabilities and Advancement

Mythos Preview demonstrated two key capabilities that differentiate it from general-purpose frontier models. The first is exploit chain construction—the ability to take multiple small vulnerabilities and reason about how to combine them into working exploits. Real attacks rarely exploit single bugs in isolation; they chain primitives together. For instance, turning a use-after-free bug into arbitrary read-write primitives, hijacking control flow, and constructing return-oriented programming (ROP) chains. Cloudflare notes that Mythos Preview’s reasoning in this domain resembles the work of senior security researchers rather than automated scanner output.

The second capability is proof generation with iterative refinement. The model doesn’t just identify potential bugs—it writes code to trigger them, compiles that code in a scratch environment, executes it, and evaluates the results. If the proof-of-concept doesn’t work as expected, the model reads the failure output, adjusts its hypothesis, and tries again. This closed-loop validation transforms speculative findings into confirmed vulnerabilities, dramatically improving signal quality. Cloudflare emphasizes that suspected flaws without working proofs are just speculation, and Mythos Preview closes this gap autonomously.

Interestingly, Cloudflare found that other frontier models could identify many of the same underlying bugs when run through their harness, and sometimes demonstrated reasonable exploit reasoning. However, these models consistently failed at the integration step—they’d identify interesting bugs and explain their significance but leave the exploit chain incomplete and exploitability uncertain. The key advancement with Mythos Preview is the ability to synthesize these pieces into coherent, demonstrable exploits, turning low-severity findings that would languish in backlogs into actionable high-severity reports.

Model Behavior Challenges: Organic Refusals

An intriguing aspect of this deployment involves what Cloudflare calls “organic refusals”—emergent guardrail behavior in the model despite having additional safeguards removed for Project Glasswing. The Mythos Preview provided to Cloudflare lacked the safety layers present in generally-available models like Opus 4.7 or GPT-5.5, yet it still exhibited unpredictable pushback on certain legitimate security research tasks.

The critical issue is inconsistency. The same task, framed differently or presented in different contexts, could produce completely different outcomes. In one instance, the model refused to perform vulnerability research on a project, then agreed to the identical research after an unrelated environmental change—the code hadn’t changed at all. In another case, the model found and confirmed serious memory bugs but then refused to write a demonstration exploit, though rephrasing the request yielded compliance. The probabilistic nature of LLMs means even semantically equivalent requests can produce opposite responses across runs.

From an LLMOps perspective, this presents a significant reliability challenge. Organic refusals aren’t consistent enough to serve as reliable safety boundaries, which is precisely why Anthropic includes additional safeguards in production models. For Cloudflare, this meant building operational processes around potential refusals and likely implementing retry logic with varied phrasings. This highlights a broader LLMOps principle: when deploying capable models, you cannot rely solely on model-internal behavior for either safety or reliability—you need architectural controls and orchestration logic that handles inconsistent responses gracefully.

The Signal-to-Noise Challenge

Cloudflare identifies the signal-to-noise problem as one of the hardest aspects of production LLM deployment for security. Triaging vulnerabilities—deciding which bugs are real, exploitable, and urgent—was difficult before AI, but AI vulnerability scanners and AI-generated code have amplified the problem considerably. Cloudflare built multiple post-validation stages specifically to handle this.

Two factors dominate the noise rate. Programming language matters significantly: C and C++ codebases, with direct memory control and bug classes like buffer overflows, consistently produced more false positives than memory-safe languages like Rust. This is partly because memory-unsafe languages have more genuine vulnerabilities, but also because models appear biased toward finding memory corruption issues even where they don’t exist.

Model bias is the second factor and perhaps more insidious. Cloudflare notes that models don’t naturally communicate confidence levels the way human researchers do. When instructed to find bugs, models will generate findings regardless of whether vulnerabilities actually exist, hedging with qualifiers like “possibly,” “potentially,” or “could in theory.” These hedged findings vastly outnumber confident ones. While this bias is reasonable for exploratory tools, it’s destructive in triage queues where every speculative finding consumes human attention and compute tokens to dismiss. The cost compounds across thousands of findings.

Mythos Preview represents a clear improvement here, particularly through its exploit chaining capabilities. Findings that arrive with working proof-of-concept code are inherently more actionable and require less validation effort. Cloudflare deliberately tunes their harnesses to over-report (favoring recall over precision to minimize missed vulnerabilities), which increases noise, but notes that Mythos Preview’s output shows noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work required to reach fix-or-dismiss decisions. This improvement in precision while maintaining recall is a key production metric for their LLMOps deployment.

Architectural Insight: Why Generic Coding Agents Fail

A significant portion of Cloudflare’s learnings addresses why simply pointing generic coding agents at repositories doesn’t work for security research. When they began AI-assisted vulnerability research the previous year, their initial approach was the obvious one: direct a general-purpose coding agent at a repository and ask it to discover vulnerabilities. This produces findings but fails to provide meaningful coverage or high-value results.

Cloudflare identifies two fundamental mismatches. First is context window management. Coding agents are optimized for single focused streams of work—building features, fixing bugs, writing refactors. They ingest large amounts of source code, maintain a single hypothesis, and iterate against it. This is the wrong shape for vulnerability research, which is inherently narrow and parallel. Human security researchers pick specific targets—a single complex feature, transitions across security boundaries, or a particular vulnerability class like command injections—and investigate thoroughly. Then they repeat this process thousands of times across different targets throughout the codebase. A single agent session, even with subagents, against a hundred-thousand-line repository can cover perhaps a tenth of a percent of the surface usefully before the context window fills and compaction begins, potentially discarding earlier findings that would matter later.

The second mismatch is throughput architecture. Single-stream agents process sequentially, but real codebases demand many hypotheses tested concurrently against many components, with dynamic fan-out when interesting patterns emerge. While you can push a single agent harder, you eventually hit limitations imposed not by the model’s capabilities but by the interaction pattern itself. Using the model directly through a coding agent interface works for manual investigation when researchers already have leads, but it’s fundamentally the wrong tool for achieving broad coverage.

This realization drove Cloudflare to abandon the coding agent approach and build a specialized harness architecture instead. This represents a mature LLMOps insight: the right deployment architecture depends on the task structure, not just the model capabilities. Generic agent frameworks optimized for conversational code generation don’t translate to batch security analysis without fundamental redesign.

The Harness Architecture: Production LLMOps at Scale

Cloudflare’s harness represents sophisticated production LLMOps orchestration. Four key lessons shaped its design, each translating into specific architectural choices:

Narrow scope produces better findings. Rather than broad directives like “Find vulnerabilities in this repository,” the harness generates specific, contextualized tasks: “Look for command injection in this specific function, with this trust boundary above it, here’s the architecture document and here’s prior coverage of this area.” This mimics how human researchers actually work—focused investigation rather than aimless exploration.

Adversarial review reduces noise. Cloudflare implements a validation stage with a second agent using a different prompt, different model, and no ability to generate its own findings. This agent’s sole job is reviewing the first agent’s work. Putting two agents in deliberate disagreement proves far more effective than having one agent check its own output. This is an interesting application of ensemble techniques and separation of concerns in LLMOps.

Splitting questions across agents improves reasoning. “Is this code buggy?” and “Can an attacker actually reach this bug from outside the system?” are distinct questions requiring different reasoning. The harness asks them sequentially through specialized agents rather than combining them, because each narrower question yields better results than the compound version. This reflects an understanding of how task decomposition affects LLM performance.

Parallel narrow tasks beat exhaustive agents. Coverage improves when many agents work concurrently on tightly scoped questions with downstream deduplication, rather than asking one agent to be comprehensive. This is fundamentally a distributed systems insight applied to LLMOps.

The production harness implements eight distinct stages:

The Recon stage employs an agent that reads repositories top-down, fans out to subsystem-specific subagents, and produces architecture documentation covering build commands, trust boundaries, entry points, and attack surface. It also generates the initial task queue for subsequent stages. This provides shared context for all downstream agents and eliminates aimless exploration.

The Hunt stage is where the primary work happens. Each task pairs one attack class with a scope hint. Hunter agents run concurrently—typically around fifty simultaneously—with each fanning out to several exploration subagents. Hunters have access to tools that compile and execute proof-of-concept code in isolated per-task scratch directories. This parallel narrow-task architecture directly addresses the throughput and coverage limitations of sequential agents.

The Validate stage implements adversarial review. An independent agent re-reads code and attempts to disprove original findings using a different prompt and without the ability to generate new findings. This catches a meaningful fraction of false positives that hunters wouldn’t identify when reviewing their own work.

The Gapfill stage addresses coverage blind spots. Hunters flag areas they touched but didn’t thoroughly investigate, and those areas get re-queued for additional passes. This counteracts the model’s tendency to drift toward attack classes where it has already succeeded, ensuring broader coverage.

Dedupe collapses findings sharing the same root cause into single records. Variant analysis is valuable, but duplicates inflate triage queues without adding information.

The Trace stage is particularly sophisticated and addresses what Cloudflare considers most important: determining reachability. For each confirmed finding in a shared library, a tracer agent fans out—one instance per consumer repository—uses a cross-repository symbol index, and determines whether attacker-controlled input actually reaches the bug from outside the system. This transforms abstract findings (“there is a flaw”) into actionable vulnerabilities (“there is a reachable vulnerability accessible to attackers”).

Feedback implements closed-loop improvement. Reachable traces become new hunt tasks in consumer repositories where bugs are actually exposed. The pipeline improves as it runs, building on its own discoveries.

Report generates structured output against a predefined schema. An agent writes reports, validates them against the schema, fixes any validation errors itself, and submits via an ingest API. This produces queryable structured data rather than free-form prose, enabling downstream automation and analytics.

The harness itself was partially built using Mythos Preview. Cloudflare used the model to build on, tailor, and improve their original harness designs to suit the model’s specific strengths. This represents a form of co-evolution between tooling and model capabilities.

Operational Considerations and Balanced Assessment

Cloudflare provides candid operational insights that temper the technical achievements. The loudest external reaction from other security leaders focused on speed—scanning faster, patching faster, compressing response cycles. Some teams are now operating under two-hour SLAs from CVE release to patched production deployment. Cloudflare argues this approach is fundamentally misguided.

Patching faster doesn’t change the pipeline that produces patches. If regression testing takes a day, achieving two-hour SLAs requires skipping it, and the bugs introduced by skipping regression testing are often worse than the original vulnerabilities. Cloudflare learned this lesson when they attempted letting models write their own patches and observed several that fixed the original bug while quietly breaking dependencies.

The harder question is architectural: what should defenses look like? The principle is making exploitation harder even when bugs exist, so the gap between disclosure and patching matters less. This means defenses that block bugs from being reached, application designs that isolate failures to prevent lateral movement, and deployment systems that can roll out fixes universally and instantaneously rather than waiting on individual teams.

Cloudflare acknowledges the dual-use nature of these capabilities. The same techniques that helped them find bugs in their own code will, in adversarial hands, accelerate attacks against every application on the internet. As a company providing protective services to millions of applications, they recognize responsibility for helping customers apply the architectural principles they’ve developed. They promise further communication on this topic.

Critical Evaluation and LLMOps Maturity Indicators

This case study demonstrates several markers of mature LLMOps practice. Cloudflare doesn’t just deploy a model; they build comprehensive infrastructure around it that handles the model’s limitations and optimizes for its strengths. They recognize that the orchestration layer matters as much as the model, perhaps more. Their harness architecture reflects deep understanding of both LLM behavior (context management, task decomposition, ensemble techniques) and their domain (security research workflows, triage economics, coverage requirements).

The case study also demonstrates operational realism. Cloudflare openly discusses false positive rates, model refusals, coverage limitations, and the gap between finding bugs and shipping patches. They recognize that speed without architecture creates more problems than it solves. This balanced perspective contrasts with vendor claims that often present technology as a complete solution rather than a capable tool requiring extensive operational support.

However, some aspects deserve scrutiny. The case study comes from Cloudflare’s blog and serves marketing purposes alongside technical communication. Claims about Mythos Preview’s capabilities over previous models are somewhat difficult to evaluate without independent benchmarks or comparisons. Cloudflare acknowledges clean comparisons are difficult because Mythos Preview does “a different kind of work,” but this also makes it hard to assess improvement magnitude objectively.

The organic refusal behavior suggests these models aren’t yet reliably controllable in production contexts. Building operational processes around unpredictable refusals and implementing retry strategies with varied phrasings adds complexity and fragility. The fact that semantically identical requests can produce opposite responses is fundamentally a reliability problem for production systems.

The harness architecture, while sophisticated, also represents significant engineering investment. Organizations considering similar deployments should recognize that pointing a model at code is perhaps 10% of the work—the other 90% is building the orchestration, validation, deduplication, tracing, and reporting infrastructure. This case study describes a mature system built by a large engineering team at a well-resourced company. Smaller organizations might struggle to replicate this approach.

The signal-to-noise challenge persists despite improvements. Cloudflare still deliberately over-reports and requires multiple validation stages to manage false positives. While Mythos Preview shows improvement, security teams deploying these systems should expect significant ongoing operational costs in triage and validation. The model doesn’t eliminate the noise problem; it reduces it to potentially manageable levels.

Finally, the case study raises important questions about dual-use technology and responsible disclosure. Cloudflare tested this against their own code in controlled environments and followed formal vulnerability management processes. But as these capabilities become more accessible, the attack-defense balance shifts. Cloudflare hints at this with their promise to share customer-focused guidance, but the broader industry implications remain uncertain.

Overall, this case study represents one of the more technically detailed and operationally honest accounts of production LLM deployment for security purposes currently available. It demonstrates that success requires far more than model access—it demands sophisticated orchestration, domain expertise, engineering investment, and realistic expectations about capabilities and limitations.

More Like This

AI-Orchestrated Code Review System at Scale

Cloudflare 2026

Cloudflare built a production AI code review system to address the bottleneck of manual code reviews across their engineering organization, where median wait times for first review were measured in hours. Rather than using off-the-shelf tools or naive LLM prompting, they developed a CI-native orchestration system around OpenCode that deploys up to seven specialized AI reviewers (covering security, performance, code quality, documentation, release management, and compliance) managed by a coordinator agent. The system has processed over 131,000 review runs across 48,000 merge requests in 5,169 repositories in the first month, with a median review time of 3 minutes 39 seconds, average cost of $1.19 per review, and only 0.6% of reviews requiring manual override, while identifying 159,103 findings with deliberate bias toward high signal-to-noise ratio.

code_generation code_interpretation prompt_engineering +27

Building an Enterprise AI Engineering Stack with Internal Agents and MCP Infrastructure

Cloudflare 2026

Cloudflare built a comprehensive internal AI engineering stack over eleven months to integrate AI coding assistants across their R&D organization, achieving 93% adoption among engineering teams. The solution involved creating an MCP-based infrastructure using their own products (AI Gateway, Workers AI, Cloudflare Access, Agents SDK, Workflows, and Sandbox SDK), developing 13 MCP servers with 182+ tools, generating AGENTS.md files for ~3,900 repositories, implementing automated AI code review for all merge requests, and establishing an Engineering Codex for standards enforcement. The result was a dramatic increase in developer velocity with merge requests nearly doubling, processing 241.37 billion tokens monthly through AI Gateway, with 3,683 active users generating 47.95 million AI requests in the last 30 days, while maintaining security through zero-trust authentication and zero data retention policies.

code_generation code_interpretation chatbot +35

Building a Generalized Internal Agent with Sandboxed Execution and Credential Brokering

Browserbase 2026

Browserbase built an internal generalized agent called "bb" to automate knowledge work across engineering, operations, sales, support, and executive functions. The problem was that many internal tasks—from investigating production sessions to logging feature requests—required manual effort and coordination across multiple systems, many of which lacked clean APIs. The solution involved creating a single agent loop that runs in isolated cloud sandboxes with credential brokering, a skills-based system for domain-specific workflows, and integration via Slack for natural interaction. The results included 100% feature request pipeline coverage with zero human effort, 99% of support tickets receiving first response in under 24 hours, session investigation time dropping from 30-60 minutes to a single Slack message, and engineers shifting from writing PRs to reviewing agent-generated ones.

customer_support code_generation document_processing +29