GitHub Security Lab developed and deployed the Taskflow Agent, an LLM-based framework for automating the triage of security vulnerabilities in code scanning alerts. The system addresses the challenge of repetitive manual security alert triage by using LLMs to identify patterns that traditional static analysis tools struggle with, such as authentication checks and sanitization functions. By breaking down complex triage workflows into discrete tasks organized in YAML files, the team successfully triaged large volumes of CodeQL alerts for GitHub Actions and JavaScript/TypeScript projects, discovering approximately 30 real-world vulnerabilities since August. The framework leverages Claude Sonnet 3.5 as the primary model, uses Model Context Protocol (MCP) servers for programmatic tasks, and maintains intermediate state in databases to enable efficient debugging and iteration.
GitHub Security Lab developed and deployed an LLM-based production system called the GitHub Security Lab Taskflow Agent to automate the triage of security vulnerability alerts from their CodeQL static analysis tool. This case study represents a sophisticated implementation of LLMs in a production security research workflow, where the team discovered approximately 30 real-world vulnerabilities across GitHub Actions and JavaScript/TypeScript projects since August.
The fundamental insight driving this work is that security alert triage is highly repetitive, with false positives often caused by patterns that are immediately obvious to human auditors but extremely difficult to encode as formal code patterns. GitHub Security Lab recognized that LLMs excel at matching these “fuzzy patterns” that traditional static analysis tools struggle with, making them ideal candidates for automating this workflow.
The Taskflow Agent framework is built around YAML-based taskflow definitions that describe sequences of tasks to be executed by an LLM. This architecture choice emerged from practical limitations: LLMs have limited context windows, and complex multi-step tasks often result in steps being skipped or incomplete when presented as a single large prompt. By decomposing workflows into discrete tasks, the framework achieves better control, debuggability, and the ability to accomplish larger and more complex analytical goals.
The framework supports batch processing with “for loop”-style asynchronous execution, allowing the same set of prompts and tasks to be applied across multiple alerts with templated prompts that replace alert-specific details at runtime. This design is particularly crucial for security research workflows where hundreds or thousands of alerts need to be processed with consistent methodology.
A key architectural decision was the use of databases to store intermediate state after each task completion. This approach provides significant operational benefits: when failures occur (API call failures, MCP server bugs, token limits, quota issues), the system can resume from the failed task rather than rerunning the entire workflow. This also enables granular debugging and iteration, as developers can tweak individual tasks and reuse results from earlier stages stored in the database.
The system delegates programmatic tasks to MCP servers rather than relying on LLMs for straightforward computational work. Initially, the team incorporated some information gathering tasks directly in prompts, assuming the LLM could extract information from source code. However, they observed inconsistencies due to the non-deterministic nature of LLMs—for example, the LLM would sometimes only record a subset of workflow trigger events or make inconsistent conclusions about privilege contexts.
By moving tasks that can be done programmatically to MCP server tools, the team achieved much more consistent outcomes. This division of labor leverages LLMs for complex logical reasoning (like finding permission checks) while keeping deterministic results for well-defined operations. The framework uses MCP servers for tasks like GitHub API calls, file fetching, and searching.
The taskflows are organized into several distinct stages that reflect the manual triage process:
Information Collection Stage: Tasks in this stage gather relevant information about alerts based on the threat model and domain knowledge. For GitHub Actions alerts, this includes checking workflow permissions, trigger events, whether workflows are disabled, and other contextual factors. Each information gathering task is kept independent and follows simple, well-defined instructions to ensure consistency. To reduce hallucination, prompts explicitly require precise source code references including file and line numbers. Each task appends findings to “audit notes”—a running commentary that gets serialized to a database for subsequent tasks. The end result is essentially a “bag of information” that forms the foundation for later analysis.
Audit Stage: This stage reviews the collected information and performs specific checks to filter out false positives. For GitHub Actions alerts, this might involve checking whether trigger events can be activated by attackers or whether they run in privileged contexts. The team designed these checks based on their manual triage experience, encoding the common patterns that lead to false positives.
Decision-Making and Report Generation: Alerts that pass the audit stage proceed to report generation. The prompts are very precise about format requirements and necessary information. Reports must be concise but include sufficient detail for verification, with precise code references and code blocks. Critically, no further analysis occurs at this stage—the LLM only looks at source code to fetch snippets needed in the report, using information already gathered in previous stages. This strict task separation reduces hallucination.
Report Validation and Issue Creation: Before creating GitHub Issues, another task validates that reports contain all necessary information and that the information is consistent. Missing or inconsistent information often indicates hallucination or failed analysis steps, and these cases are rejected. This validation step serves as a critical quality gate before human review.
The team focused on two specific CodeQL queries for GitHub Actions: checkout of untrusted code in privileged contexts and code injection vulnerabilities. These queries share substantial analytical overlap—both require checking workflow trigger events, permissions, and tracking workflow callers. The main differences involve local analysis of vulnerability-specific details.
Common false positive patterns that the taskflows learned to identify include:
pull_request vs pull_request_target)The GitHub Actions triage taskflow consists of several specialized tasks. The workflow trigger analysis task performs both information gathering and auditing in a single step, collecting trigger events, permissions, and secrets while checking if the workflow is disabled. Since this analysis is local to the vulnerable workflow, it combines both stages efficiently. The code injection point analysis task similarly analyzes the vulnerable workflow, collecting information about injection locations and user inputs while performing local auditing to check input validity and sanitizer presence.
For workflow user analysis, the task is divided into separate information gathering and auditing stages because it potentially retrieves and analyzes large numbers of files. The information gathering task retrieves callers of vulnerable workflows and records their trigger events, permissions, and secret usage. The auditing task then determines whether the vulnerable workflow is reachable by attackers based on this information.
After these stages, the notes contain comprehensive information about trigger events, permissions, secrets, and (for reusable workflows) calling workflows with their corresponding attributes. This forms the basis for bug reports, with a review task checking for completeness and consistency before report creation.
The framework then creates GitHub Issues containing sufficient information and code references for quick verification. These issues also serve as summaries enabling further analysis. The team developed a secondary workflow (review_actions_injection_issues) that collects alert dismissal reasons from repositories and checks them against issues. Since issues contain all relevant information and code references, the LLM can use issues and dismissal reasons to discover additional false positives, incorporating repository-specific security measures into the analysis.
The team also triaged client-side cross-site scripting (js/xss) alerts in JavaScript/TypeScript codebases. These alerts have more variety in sources, sinks, and data flows compared to GitHub Actions alerts. The prompts focus on helping human triagers make educated decisions rather than making autonomous decisions, highlighting aspects that make alerts exploitable and, more importantly, what likely prevents exploitation.
Common false positive patterns for XSS alerts include:
The team iteratively extended prompts and the active personality based on encountered false positives. The triage results in GitHub Issues that either highlight exploitable vulnerabilities (with detailed attack vectors) or are labeled “FP” for false positives with explanations of why they’re not exploitable.
The team employed several sophisticated prompt engineering strategies to improve reliability and reduce hallucination:
Precise Instructions and Formatting Requirements: Prompts specify exact formats for reports and what information must be included. This precision reduces ambiguity and helps ensure consistent outputs across different alerts and runs.
Code Reference Requirements: Prompts explicitly require the LLM to include precise references with file and line numbers to back up collected information. This grounds the analysis in verifiable source code locations and makes hallucinations more detectable.
Task Decomposition: Complex multi-step analyses are broken into smaller, independent tasks that each start with a fresh context. This addresses issues with large context windows and reduces the likelihood of steps being skipped. The use of templated repeat_prompt tasks creates new contexts for each item in a list rather than processing lists within a single prompt.
Separation of Concerns: Information gathering tasks are kept independent of each other and don’t read each other’s notes, allowing each to focus on its scope without distraction. Auditing tasks then synthesize information from the “bag of information” created by gathering tasks.
Validation Checkpoints: Multiple validation steps check for completeness and consistency of information. Missing or inconsistent information triggers rejection, as it often indicates hallucination or failed analysis steps.
The team primarily uses Claude Sonnet 3.5 as their LLM, though they built in model configuration features that allow updating model versions across taskflows easily. This flexibility is important given the rapid pace of LLM development and enables experimentation with different models.
The framework is designed to be resource-intensive—taskflows can result in many tool calls that consume substantial API quota. The team explicitly warns users about this consideration. The system creates GitHub Issues as outputs, and the team emphasizes being considerate and seeking repository owners’ consent before running taskflows on third-party repositories.
Critically, while the taskflows automate much of the analysis, GitHub researchers carefully review all generated output before sending vulnerability reports. This human-in-the-loop approach ensures quality and accuracy before security disclosures are made.
Both the seclab-taskflow-agent and seclab-taskflows repositories are open source, allowing the broader community to develop similar LLM taskflows. The team built significant reusability features into the framework:
These features emerged from practical development needs. As the team developed multiple triage taskflows, they realized many tasks could be shared. Reusability features ensure that improvements and tweaks can be applied consistently across different workflows without extensive copy-pasting.
Since deploying these taskflows (with work beginning around August), the team discovered approximately 30 real-world vulnerabilities. The taskflows substantially reduced false positives without requiring dynamic validation of alerts. The LLMs were only given basic file fetching and searching tools, without access to static or dynamic code analysis tools beyond CodeQL’s initial alert generation.
The system’s accuracy is noteworthy given that it doesn’t create exploits or have runtime environments for testing conclusions. Instead, it produces detailed bug reports with all necessary information for human verification, striking a balance between automation and human oversight.
The team is transparent about several important limitations and considerations:
No Dynamic Validation: The taskflows don’t perform end-to-end analysis or create exploits to validate findings. Results require human verification, though they note that accuracy remains “fairly accurate” even without automated validation.
Hallucination Management: Despite various mitigation strategies (precise prompts, code references, validation steps), hallucination remains a concern. The architecture includes multiple checkpoints specifically to detect and filter out hallucinations.
Resource Intensive: The system can consume substantial API quota due to numerous tool calls. This is a practical consideration for deployment at scale.
Domain-Specific Design: The taskflows are designed for specific vulnerability types based on the team’s manual triage experience. Creating effective taskflows requires clear formulation of tasks into well-defined instructions that LLMs can consume.
The team identifies characteristics that make workflows good candidates for this approach:
These represent “sweet spots” for LLM automation where the flexibility of language models provides value over traditional programmatic approaches, while MCP servers handle well-defined computational tasks.
The development process involved significant iteration and learning. Initially, the team tried to have LLMs gather all information from source code through prompts, but encountered inconsistencies. Moving deterministic checks to MCP servers improved consistency substantially. Similarly, the team learned to break down complex tasks into smaller ones after encountering issues with steps being skipped in large contexts.
The database-backed state management emerged as a critical operational feature, enabling efficient debugging and recovery from failures. The ability to rerun taskflows from failed tasks rather than starting over significantly improved development velocity.
The team also developed patterns for extending analysis beyond initial triage. By creating GitHub Issues with comprehensive information, they enabled subsequent taskflows to incorporate repository-specific knowledge through alert dismissal reasons. This creates a feedback loop where the system learns from human decisions and becomes more effective at detecting false positives over time.
This case study represents a mature, production-deployed application of LLMs for security research workflows, demonstrating sophisticated engineering practices around task decomposition, state management, prompt engineering, and human-AI collaboration. The open-source release enables the broader community to apply these patterns to their own security research and code analysis workflows.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.
This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.