Cursor faced a challenge where their PR velocity increased 5x over nine months, making traditional static analysis and code ownership insufficient for security at scale. They implemented Cursor Automations to build a fleet of autonomous security agents that continuously identify and repair vulnerabilities in their codebase. The solution includes four main automation templates: Agentic Security Review (which has run on thousands of PRs and prevented hundreds of issues in two months), Vuln Hunter (for scanning existing code), Anybump (which automates dependency patching), and Invariant Sentinel (for daily compliance monitoring). These agents operate through a custom security MCP tool deployed as a serverless Lambda function, providing persistent data storage, deduplication of LLM-generated findings, and consistent output formatting.
Cursor, a development tools company, presents a compelling case study of using LLM-powered autonomous agents in production to address security challenges at scale. The core problem they faced was the mismatch between their rapidly increasing development velocity (5x increase in PR volume over nine months) and the limitations of traditional security tooling based on static analysis and rigid code ownership. While these traditional approaches remained helpful, they proved insufficient at the new scale of operations.
The company’s solution was to leverage their own product, Cursor Automations, to build a “fleet” of security agents that operate continuously to identify and remediate vulnerabilities. This case study is particularly interesting from an LLMOps perspective because it demonstrates a company using their own AI-powered product internally to solve real production challenges, effectively “dogfooding” their technology while creating reusable templates that other security teams can customize.
The foundation of Cursor’s approach rests on two critical features that make agents useful for security workloads. First, they built out-of-the-box integrations for receiving webhooks, responding to GitHub pull requests, and monitoring codebase changes. This integration layer allows background agents to understand when they need to activate and take action, creating an event-driven architecture for security automation.
Second, they provide a rich agent harness and environment powered by cloud agents, which gives these security automations access to all the tools, skills, and observability capabilities that their cloud agent infrastructure provides. This suggests a substantial investment in agent infrastructure that goes beyond simple API calls to LLMs, creating a full execution environment for autonomous operations.
A particularly noteworthy technical decision was building a custom security MCP (Model Context Protocol) tool deployed as a serverless Lambda function. This design choice reflects thoughtful LLMOps practice - the tool is “available just-in-time when needed, and not otherwise running,” which optimizes for cost and resource utilization while maintaining availability. The serverless deployment pattern is well-suited for intermittent security scanning workloads that may have variable execution patterns.
The security MCP tool addresses three critical challenges that arise when deploying LLMs in production security contexts:
Persistent Data and Metrics: The MCP provides storage capabilities so agents can track and measure security impact over time. This data feeds back into the system to “continually refine when and how we trigger automations,” suggesting an iterative improvement loop. This addresses a common LLMOps challenge: understanding the actual effectiveness of LLM-powered systems requires instrumentation and long-term measurement, not just deployment.
Deduplication of LLM-Generated Findings: One of the most interesting technical challenges highlighted is that because multiple review agents run on every change and their findings are generated by LLMs, different agents can describe the same underlying issue using different words. This is a classic problem with LLM outputs - semantic consistency versus lexical consistency. Their solution deploys a classifier powered by Gemini Flash 2.5 that determines when two semantically distinct findings describe the same problem. This meta-application of LLMs (using an LLM to deduplicate other LLM outputs) is a sophisticated approach to managing the non-deterministic nature of LLM-generated content in production.
Consistent Output Formatting: The MCP standardizes how vulnerabilities are reported, sending consistently formatted Slack messages and handling actions like dismissing or snoozing findings. This abstraction layer is important for production LLM systems because it separates the variable, natural-language outputs of the agents from the structured interfaces that humans and downstream systems need.
Cursor uses Terraform to manage changes to security tooling, ensuring “all changes to security tooling go through a standard review and deployment process.” This reflects mature LLMOps practices where prompt engineering, agent configuration, and automation logic are treated as critical infrastructure that requires version control, review, and controlled deployment. This is an important detail because it suggests they’re not just running ad-hoc LLM queries but have formalized the operational aspects of their agent-based systems.
Agentic Security Review represents their most mature automation, evolved through a careful staged rollout that demonstrates thoughtful LLMOps deployment practices. They started with an existing general-purpose tool (Bugbot) for code quality reviews but recognized its limitations for security: it couldn’t be prompt-tuned to their specific threat model, and they needed the ability to block CI on security findings specifically without blocking on every code quality issue.
The deployment progression shows prudent risk management: they first forwarded findings to a private Slack channel monitored by the security team, allowing human validation of the agent’s outputs. Once they gained confidence in the quality of identified issues, they enabled PR commenting. Finally, they implemented a blocking gate check that can prevent merges. This graduated rollout strategy is exemplary LLMOps practice - it allows for validation of model behavior in production contexts before granting the system decision-making authority that could block developer workflows.
The reported results - running on thousands of PRs and preventing hundreds of issues in just two months - sound impressive, though the text provides no false positive rate or precision metrics. From an LLMOps evaluation perspective, this is a notable omission. While preventing hundreds of issues is valuable, understanding how many legitimate PRs were incorrectly flagged or delayed would provide a more complete picture of the system’s production performance.
Vuln Hunter extends the agent approach from reviewing new code to scanning the existing codebase. The automation divides code into logical segments and searches each for vulnerabilities. The human-in-the-loop aspect is preserved here - the security team triages findings and typically fixes them, often using “@Cursor from Slack to generate PRs.” This workflow demonstrates an interesting integration pattern where LLM agents identify issues and then different LLM interfaces (the Slack integration) help humans remediate them. The segmentation approach suggests awareness of context window limitations and the need to break down large codebases into manageable chunks for analysis.
Anybump tackles the notoriously tedious problem of dependency patching. The text notes that this work is “so time intensive that most security teams eventually give up and push it to engineering, where it sits in backlogs.” This automation demonstrates a more complex multi-step agent workflow: it runs reachability analysis to prioritize vulnerabilities that are actually impactful (reducing false positives from dependencies that are installed but whose vulnerable code paths aren’t used), traces through relevant code paths, runs tests, checks for breakage, and only opens a PR once tests pass.
The integration with “Cursor’s canary deployment pipeline” provides a final safety gate before production. This multi-layered approach to risk mitigation is sophisticated LLMOps - the agent doesn’t just propose patches, it validates them through testing, and then the deployment infrastructure provides additional safety mechanisms. However, the text doesn’t specify what happens when tests fail or when reachability analysis is inconclusive, which would be valuable implementation details for understanding the robustness of this automation.
Invariant Sentinel monitors daily for drift against security and compliance properties. It divides the repository into logical segments and “spins up subagents” to validate code against a list of invariants. This multi-agent architecture where a coordinator spawns specialized subagents is an advanced pattern that suggests they’ve built substantial orchestration capabilities.
The automation uses a memory feature to compare current state against previous runs. When drift is detected, it revalidates to ensure correctness (a sensible safeguard against false positives), updates its memory, and sends a detailed Slack report with specific code locations as evidence. The fact that this automation “runs in a full development environment” where “the agent can write and execute code to validate its own assumptions” is particularly interesting from an LLMOps perspective - the agents aren’t just analyzing static code, they can execute programs to test hypotheses. This capability significantly expands what’s possible but also introduces additional risks around sandbox security and resource consumption that the text doesn’t address.
While the case study presents an impressive application of LLM agents to production security workflows, several important LLMOps considerations are either unaddressed or warrant scrutiny:
Evaluation and Metrics: The text provides limited quantitative evaluation of the agents’ performance. “Prevented hundreds of issues” is mentioned for Agentic Security Review, but there’s no discussion of false positive rates, precision, recall, or how they measure whether an issue was genuinely prevented versus flagged unnecessarily. For production LLM systems, especially those that can block CI/CD pipelines, understanding the precision-recall tradeoff is critical.
Prompt Engineering and Model Selection: The case study mentions that agents can be “prompt-tuned to our specific threat model” but provides no details about how prompts are designed, tested, or evolved. Similarly, while Gemini Flash 2.5 is mentioned for the deduplication classifier, there’s no discussion of which models power the main security agents or how model selection decisions were made. The reference to “cloud agents” with “tools, skills, and observability” suggests a sophisticated system, but the underlying LLM infrastructure remains opaque.
Cost and Resource Management: Running multiple review agents on every code change, executing tests for dependency patching, and daily scanning of the entire codebase likely involves substantial compute costs and API usage. The serverless Lambda deployment for the MCP tool suggests cost consciousness, but the overall economics of running these automations at scale aren’t discussed. For organizations considering similar approaches, understanding the cost-benefit tradeoff would be valuable.
Error Handling and Failure Modes: What happens when agents hallucinate vulnerabilities, when the deduplication classifier fails, when test execution times out, or when agents disagree? The text mentions that the deduplication classifier handles cases where “different agents can end up using different words to describe the same underlying issue,” but what about cases where agents miss issues or report different severity assessments for the same problem?
Human Oversight and Trust Calibration: The progression from monitoring to commenting to blocking for Agentic Security Review shows appropriate caution, but the case study doesn’t discuss ongoing human oversight. Are there mechanisms for developers to contest or override agent decisions? How do they prevent over-reliance on automated systems potentially causing security teams to miss subtle issues that agents can’t detect?
Generalizability: The case study presents Cursor’s internal use of their own product. This raises questions about how well these approaches generalize. Cursor’s team likely has deep expertise in their own tooling and can customize it in ways external users cannot. The claim that “other security teams can customize these templates” is encouraging, but the level of LLM expertise, infrastructure, and iteration required to achieve similar results for other organizations remains unclear.
Template Release and Reusability: The text states they’re “releasing four new automation templates with the exact blueprints of the security agents we’ve found to be most helpful.” This is positive for the community, but the case study doesn’t discuss how much customization these templates require, what security-specific knowledge needs to be encoded, or how transferable the approaches are across different codebases, languages, or threat models.
Despite these gaps, the case study demonstrates several valuable LLMOps patterns:
The case study represents an advanced application of LLMs in production, moving beyond simple code completion or chat interfaces to autonomous agents that make consequential decisions about security. The architecture—with its custom MCP tool, multi-agent coordination, integration with infrastructure-as-code practices, and staged deployment—demonstrates sophisticated LLMOps maturity. However, the lack of detailed evaluation metrics, error handling discussion, and cost analysis means readers should approach the claimed benefits with appropriate skepticism while recognizing the architectural patterns as potentially valuable for their own LLM deployments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.