Mozilla built an AI-powered security auditing pipeline to identify and fix latent security vulnerabilities in Firefox, using advanced language models like Claude Mythos Preview and Claude Opus 4.6. The problem was that traditional fuzzing and manual code review were insufficient to find complex security bugs, particularly sandbox escapes and intricate race conditions across Firefox's multi-process architecture. Mozilla's solution involved developing an agentic harness that could not only statically analyze code but also dynamically create and run reproducible test cases to validate hypotheses about vulnerabilities. The results were unprecedented: 271 bugs identified by Claude Mythos Preview alone were fixed in Firefox 150, with 423 total security bugs fixed in April 2026 releases, including 180 sec-high severity issues. The pipeline successfully identified vulnerabilities ranging from 15-year-old bugs to complex sandbox escapes that had evaded extensive fuzzing.
Mozilla’s Firefox team developed and deployed a production-scale AI-powered security vulnerability detection pipeline that represents a sophisticated implementation of LLMOps principles. This case study demonstrates how the rapid evolution of language model capabilities between early 2024 and early 2026 fundamentally changed the economics and effectiveness of AI-assisted security auditing. The project moved from experimental prompting with GPT-4 and Claude Sonnet 3.5 to a full production pipeline capable of identifying hundreds of critical security vulnerabilities that had evaded traditional detection methods including extensive fuzzing and manual code review.
The context for this deployment is particularly important: Mozilla acknowledges that just months before this work, AI-generated security bug reports were largely considered “unwanted slop” that imposed asymmetric costs on maintainers. The transformation came from two factors: dramatically improved model capabilities and, critically, Mozilla’s development of sophisticated techniques for harnessing, steering, scaling, and stacking models to generate signal while filtering noise.
The core innovation in Mozilla’s approach was building an agentic harness atop their existing fuzzing infrastructure. This represents a crucial LLMOps pattern: rather than relying purely on static analysis where models generate hypotheses that humans must validate, the harness provides models with the capability to dynamically test their hypotheses. The key distinguishing feature is that given appropriate interfaces and instructions, the system can create and run reproducible test cases to validate suspected vulnerabilities.
This architectural decision addresses one of the fundamental challenges in deploying LLMs for code analysis: the high false positive rate that plagued earlier attempts. By enabling the model to validate its own findings through dynamic testing, Mozilla created a system that could both discover real bugs and dismiss unreproducible speculation, making the entire pipeline scalable in a way that pure static analysis could not achieve.
Mozilla’s deployment strategy demonstrates mature LLMOps thinking around model selection and upgrading. They began experiments with publicly available models like GPT-4 and Claude Sonnet 3.5, showing some promise but facing impractical false positive rates. They then moved to Claude Opus 4.6 for small-scale experiments targeting sandbox escapes, which already identified “an impressive amount of previously-unknown vulnerabilities requiring complex reasoning over multiprocess browser engine code.”
Critically, Mozilla built their pipeline architecture to make model swapping trivial. This architectural decision paid dividends when Claude Mythos Preview became available - they could immediately leverage the new capabilities without rebuilding their infrastructure. The text notes that model upgrades increase effectiveness across the entire pipeline simultaneously: better at finding potential bugs, creating proof-of-concept test cases, and articulating pathology and impact. This suggests that the prompting and orchestration layer was designed to be model-agnostic, focusing on the interaction patterns rather than model-specific quirks.
Mozilla’s prompting strategy evolved through direct observation and iteration. They began with supervised terminal sessions to observe the process in real-time and tune prompts and logic. The initial prompts were apparently quite simple, described as “not dissimilar from those described here” with reference to basic vulnerability scanning approaches. Through iteration they built “a lot of orchestration and tooling to optimize and scale the pipeline,” but note that “the essence of the inner loop remains the same: there is a bug in this part of the code, please find it and build a testcase.”
This reveals an important LLMOps lesson: sophisticated results don’t necessarily require complex prompts, but rather the right scaffolding and feedback loops. The orchestration layer handles the complexity of directing the model to specific targets, managing the execution environment, and processing results, while the core prompting remains relatively straightforward.
Once the basic harness proved effective, Mozilla parallelized operations across multiple ephemeral VMs. Each VM was tasked to hunt for bugs within a specific target file and write findings back to a bucket. This represents a classic LLMOps scaling pattern: partition the work into independent units that can be distributed, then aggregate results. The use of ephemeral VMs suggests attention to both resource management and isolation concerns - each audit runs in a clean environment and resources are deallocated after completion.
The targeting strategy is also notable from an LLMOps perspective. Initially, scanning was “largely focused on specific areas of the code (files, functions) where we instruct the system to look, based on a mix of human judgement and automated signals.” This hybrid approach - combining human expertise about where vulnerabilities are likely with automated signals - demonstrates sophisticated orchestration that goes beyond pure automated scanning.
Mozilla emphasizes that the discovery subsystem, while necessary, is insufficient alone. They built a complete security bug lifecycle pipeline that handles:
This pipeline is explicitly described as “inherently project-specific, reflecting each codebase’s semantics, tooling, and processes.” Standing it up required significant iteration with a tight feedback loop alongside Firefox engineers who were fielding incoming bugs. This highlights a crucial LLMOps reality: the model and harness are just components in a larger system that must integrate with existing development, security, and release processes.
The integration work included handling the unprecedented volume of findings - over 100 people contributed code to shipping fixes, with additional staff building and scaling the pipeline, triaging, testing fixes, and managing releases. This organizational scaling challenge is as much a part of the LLMOps story as the technical infrastructure.
The case study provides extensive detail on the sophistication of bugs discovered, which speaks to the model’s reasoning capabilities:
Complex temporal bugs: 15-year-old bugs requiring “meticulous orchestration of edge cases across distant parts of the browser, including recursion stack depth limits, expando properties, and cycle collection”
Race conditions over IPC: Reliably exploiting race conditions that allow compromised content processes to manipulate parent process state, requiring understanding of multiprocess architecture and timing
Sandbox escapes: Multiple bugs allowing escape from content process sandbox to parent process, including exploiting race conditions with thousands of operations to stretch timing windows
Novel attack vectors: Simulating malicious DNS servers by intercepting glibc function calls to reproduce edge cases, demonstrating creative problem-solving
Ancient XSLT vulnerabilities: 20-year-old bugs involving reentrant calls causing hash table rehashing that frees backing store while raw pointers are in use
Extremely compact exploits: Small testcases exploiting special HTML table semantics to overflow 16-bit layout bitfields, suggesting the model can identify minimal reproduction cases
Particularly notable are the sandbox escape vulnerabilities, which the text describes as “notoriously difficult to find with fuzzing.” These bugs presume an already-compromised sandboxed process and require reasoning about trust boundaries and privilege escalation paths. The model is even permitted to patch Firefox source code as long as modifications only run in the sandboxed process - demonstrating sophisticated constraint-following in the vulnerability discovery process.
Mozilla’s approach to evaluation includes several layers:
Dynamic validation: The harness creates and runs reproducible test cases, providing concrete evidence rather than speculation.
Severity classification: Findings are classified as sec-critical, sec-high, sec-moderate, or sec-low based on exploitability and user behavior requirements. Of the 271 bugs from Claude Mythos Preview: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
Defense validation: Mozilla notes with interest what the models didn’t find despite trying - specifically, attempts to exploit prototype pollution in the parent process were thwarted by architectural changes that freeze prototypes by default. This demonstrates that the evaluation framework captures both successful exploits and failed attempts, providing feedback on defensive measures.
Human review: Every bug requires care and attention to properly fix, suggesting human engineers validate and address each finding rather than automated patching.
Looking forward, Mozilla plans to integrate this analysis into their continuous integration system to scan patches as they land. They note that “models are quite flexible with the form of context provided” and expect patch-based scanning to work as well or better than file-based scanning. This represents a significant LLMOps evolution - moving from batch analysis of existing code to continuous analysis of changes, catching vulnerabilities before they reach production.
This CI integration would create a tight feedback loop where every code change is automatically audited for security implications, representing a shift from periodic security review to continuous security validation. The architectural flexibility to switch from file-based to patch-based context demonstrates robust prompt engineering and context management.
The numbers are striking:
Beyond Firefox 150, additional fixes shipped in versions 149.0.2, 150.0.1, and 150.0.2. The scale required over 100 contributors writing and reviewing patches, with additional staff on pipeline operations, triage, testing, and release management.
The case study provides rare insight into how model capability improvements translate to practical impact. Early experiments with GPT-4 and Sonnet 3.5 showed promise but impractical false positive rates. Claude Opus 4.6 delivered impressive results on sandbox escapes. Claude Mythos Preview represented another significant jump, finding hundreds of additional vulnerabilities including increasingly subtle and complex bugs.
Mozilla explicitly states that model upgrades improve the entire pipeline simultaneously - better discovery, better proof-of-concept generation, and better articulation of findings. This suggests that their LLMOps architecture successfully abstracts over model-specific details, allowing them to capture value from each capability increase without significant re-engineering.
Several operational challenges emerge from the case study:
Volume management: The “unprecedented volume” of findings led to “a lot of work and long days over the last few months.” Even with effective automation, validating and fixing hundreds of security bugs requires substantial human effort.
Pipeline specificity: While harnesses may be reusable across projects, the full pipeline is “inherently project-specific.” Standing up the integration with Firefox’s development processes required significant iteration.
Deduplication: With findings coming from multiple models, fuzzing, manual inspection, and external reports, deduplication becomes critical to avoid wasted effort.
Transparency trade-offs: Mozilla made the “calculated decision” to unhide sample bug reports earlier than normal given the extraordinary interest and ecosystem urgency, balancing transparency against protecting users who haven’t updated.
While the results are impressive, several considerations merit attention:
Selection bias in examples: Mozilla acknowledges the sample of unhidden bug reports was “somewhat arbitrary” despite attempting to draw from a range of browser subsystems. The most impressive findings may not be representative of typical performance.
Attribution complexity: Of 423 total bugs fixed in April, 271 came from Claude Mythos Preview, but the remainder were split between other models, fuzzing, and manual inspection. The text doesn’t provide detailed comparative metrics on precision/recall across methods.
Exploitability assumptions: Mozilla explicitly states they “generally don’t build exploits to see whether a bug could be used by an attacker in the real world,” classifying sec-high based on crash symptoms. Some findings may not be practically exploitable, though the threat model conservatively assumes they could be.
Infrastructure requirements: The success required significant engineering investment in building the harness, pipeline, VM infrastructure, and CI integration. Smaller projects may face challenges replicating this approach.
Human effort still critical: Despite automation, over 100 people contributed to shipping fixes. The system augments rather than replaces security expertise.
Mozilla provides concrete advice for teams looking to adopt similar approaches:
The emphasis on getting started immediately with simple approaches, then iterating based on observation, reflects pragmatic LLMOps thinking. The recommendation to build infrastructure early proved valuable for Mozilla when Claude Mythos Preview became available.
Mozilla positions this work in the context of an “asymmetric” security landscape where attackers can use the same models to find vulnerabilities. Their call for “defenders to begin applying these techniques” and assertion that “the current moment is a perilous one, but also full of opportunity” frames this as an ecosystem-wide challenge requiring coordinated response.
The dramatic shift from AI-generated security reports being “unwanted slop” to finding hundreds of critical vulnerabilities in just months represents a phase change in LLM capabilities for specialized technical tasks. Mozilla’s experience suggests that organizations with critical security surfaces should be actively building these capabilities now, as the defensive advantage from early adoption may be significant.
The project demonstrates mature LLMOps practices: model-agnostic architecture, tight integration with existing processes, sophisticated orchestration, continuous improvement through observation and iteration, and realistic assessment of both capabilities and limitations. It represents one of the most sophisticated public examples of LLMs deployed in production for high-stakes technical work where false positives and false negatives both carry significant costs.
Cursor, a developer tool company, shares their journey of building what they call a "software factory" where AI agents handle increasingly autonomous software development tasks. The presentation outlines how they progressed through levels of autonomy from basic autocomplete to spawning hundreds of agents working asynchronously across their codebase. Their solution involves establishing guardrails through rules that emerge dynamically, creating verifiable systems with automated testing, and building skills and integrations that enable agents to work independently. Results include engineers managing fleets of agents rather than writing code directly, with some features being developed entirely by agents from feature flagging through testing to deployment, though significant work remains in observability, orchestration, and preventing agents from going off-track.
Cloudflare built a comprehensive internal AI engineering stack over eleven months to integrate AI coding assistants across their R&D organization, achieving 93% adoption among engineering teams. The solution involved creating an MCP-based infrastructure using their own products (AI Gateway, Workers AI, Cloudflare Access, Agents SDK, Workflows, and Sandbox SDK), developing 13 MCP servers with 182+ tools, generating AGENTS.md files for ~3,900 repositories, implementing automated AI code review for all merge requests, and establishing an Engineering Codex for standards enforcement. The result was a dramatic increase in developer velocity with merge requests nearly doubling, processing 241.37 billion tokens monthly through AI Gateway, with 3,683 active users generating 47.95 million AI requests in the last 30 days, while maintaining security through zero-trust authentication and zero data retention policies.
Datadog shares lessons learned from building over 100 AI agents in production and preparing to scale to thousands more. The company deployed multiple production agents including Bits AI SRE for autonomous alert investigation, Bits AI Dev for code generation and error fixes, and security analysts for automated security investigations. Key challenges addressed include making systems agent-native through API-first design, transitioning from reactive chat interfaces to proactive background agents, implementing comprehensive evaluation systems, maintaining model and framework agnosticism, and establishing robust monitoring for autonomous operations. The case study emphasizes that intelligence is no longer the bottleneck—operational excellence and proper LLMOps practices are now the critical factors for successful agent deployment at scale.