TRM Labs evolved their initial single-purpose vulnerability patching agent into a unified Slack-native AI orchestrator that autonomously handles multiple security workflows across their entire infrastructure. The original system automated CVE remediation across 150+ repositories using reinforcement learning, but TRM recognized that all security workflows share the same five-step pattern: alert, investigate, diagnose, fix, and close. They rebuilt the architecture around Claude Opus as a central orchestrator with 14 skills and 56 tools, handling security alert triage, PR reviews, helpdesk requests, and vulnerability remediation. The platform now processes approximately 10,000 interactions monthly, auto-closes 17% of security alerts without human intervention, resolves 45% of helpdesk requests without creating tickets, and autonomously approves low-risk infrastructure PRs while escalating complex cases with enriched context. The system operates as a production service with per-workflow SLAs, comprehensive OpenTelemetry instrumentation, and a knowledge flywheel that continuously improves through captured observations.
TRM Labs, a blockchain intelligence and crypto crime detection company, has built and evolved a sophisticated AI-powered security orchestrator that represents a compelling case study in production LLM operations. The journey began with a focused vulnerability remediation agent and evolved into a unified platform handling multiple critical security workflows across their engineering organization. Published in June 2026, this case study provides detailed insights into how they designed, deployed, and operated an autonomous AI system at scale while maintaining strict safety controls and production-grade reliability.
The fundamental insight that drove their architectural evolution was recognizing that every security workflow follows the same five-step pattern: alert, investigate, diagnose, fix, and close. Rather than building separate automation for each workflow, they created a unified orchestrator that could learn across workflows and compound its effectiveness over time. This shift from function-oriented automation to platform-oriented orchestration represents a mature approach to LLMOps in production environments.
TRM’s original vulnerability agent (V1) was essentially a specialized function: input a CVE, output a pull request. It used reinforcement learning trained on months of merge, revert, and comment signals to generate production-ready patches across 150+ repositories, clearing hundreds of critical vulnerabilities monthly with minimal human intervention. While effective for its narrow purpose, this architecture couldn’t generalize to other security workflows without complete reimplementation.
The V2 architecture fundamentally reimagined the system as an orchestrator-workers pattern with Claude Opus serving as the central dispatcher. The key architectural decision was to preserve the original patch agent’s Cloud Run and Claude Code worker loop as the kernel for one skill among many, then build outward from there. This allowed them to retain proven functionality while enabling horizontal expansion to new workflows.
The platform now runs as a single Cloud Run service handling every Slack event through one orchestrator loop with adaptive thinking. The system exposes 14 user-invocable skills plus operational triage capabilities, controlling 56 tools in total. What makes this architecture particularly effective is tool subsetting: rather than presenting all 56 tools to the model on every interaction, the system narrows the available toolset based on the current channel and workflow context. For example, when operating in the security alerts channel, the agent sees only CSPM, DSPM, Cloudflare, Brain (their knowledge system), and a few read operations. During PR review, it sees only GitHub, Recon, and Brain tools. This subsetting produces significantly sharper tool selection decisions and tighter authorization boundaries than exposing the full tool surface.
The deployment architecture demonstrates sophisticated operational thinking around high availability, testing, and cost management. TRM runs two Cloud Run services from the same container image: production and staging. Both share one Slack app, one Firestore database, one set of secrets, and a single Pub/Sub fanout mechanism. Channel ownership is declared in a single YAML configuration file (agents-config.yaml) that defines which Slack channels each service owns.
The fanout mechanism provides elegant simplicity: Slack posts to one public load balancer, which verifies the HMAC signature once and republishes events to a Pub/Sub topic. Both Cloud Run services subscribe via push, so every Slack event lands at both services. Each service then consults the configuration file to determine if the event’s channel is one it owns; if not, the event is silently dropped without log noise or double-handling. This design enables engineers to ship features to staging-owned channels, test against real Slack and real Firestore state, and only merge to main once validated, at which point GitHub Actions automatically deploy to production.
For parallel development, the staging service can host multiple sibling revisions, each with its own routing entry in Firestore, allowing two engineers to iterate simultaneously without contention. This separation of concerns between routing configuration and service deployment exemplifies mature infrastructure design for AI systems.
The platform implements three distinct execution patterns optimized for different latency and interaction requirements. Inline skills like triage, CSPM, audit, GitHub, recon, Cloudflare, and DSPM run directly within the orchestrator’s HTTP request, completing in under 30 seconds and capable of posting interactive Slack buttons. Cloud Task skills like patch and brain-evolve enqueue work and post a “queued” message within three seconds, then dispatch to a worker that runs the Claude Code CLI as a subprocess for several minutes, updating the same Slack message in place when complete. The same Cloud Run image serves both roles, isolated by setting maximum instance request concurrency to one, allowing ten patch jobs that would take 50 minutes sequentially to complete in approximately five minutes across parallel instances.
Workflow skills represent the most sophisticated execution pattern, where users never explicitly invoke a skill name. Instead, the Slack event handler detects patterns—a PR URL, a bot-posted alert, a message in a helpdesk channel—and automatically pins the agent loop to one skill with a narrowed tool surface. This pattern recognition approach creates seamless user experience while maintaining strict operational boundaries around what the agent can access and execute.
The security alerts workflow demonstrates the platform’s autonomous capabilities most clearly. At TRM, security alerts from their telemetry stack centralize in a dedicated Slack channel where engineers previously triaged, investigated, and remediated every issue manually. Now every bot-posted message in that channel flows through Claude Opus before reaching a human.
The handler triggers on any top-level message in the channel authored by a bot. It builds the alert body, applies prompt injection fencing, and runs the agent loop with an in-memory workflow identity of “security-alerts.” This workflow identity acts as a scoped allowlist, granting the agent approximately 25 tools: CSPM read and close, Recon, GitHub read, Cloudflare read, DSPM read, Brain, and a meta-signal called security_alert_resolve. Critically, the agent never impersonates a human security team member.
Triage resolves to one of three outcomes: auto-closing well-understood, low-risk alerts at the source system; enriching and flagging alerts requiring human judgment; or escalating with gathered context attached. In the past 30 days, approximately 238 alerts (17%) were closed without paging anyone, while the remainder escalated cleanly with enriched investigation context. This outcome distribution demonstrates appropriate autonomy boundaries—the agent handles clear-cut cases but escalates ambiguity rather than making potentially incorrect closure decisions.
The patch skill represents the direct descendant of TRM’s V1 vulnerability agent, but with the reinforcement learning loop replaced by a dramatically simpler approach. The two-stage shape survived: the agent makes code changes in an isolated, sandboxed worker, then opens a pull request. What changed is that the bespoke ML pipeline became a general-purpose coding agent pointed at a narrow, well-scoped task producing human-reviewable PRs.
Platform integration adds several sophisticated capabilities beyond the original implementation. Before every run, a dedup check verifies whether in-flight or recently completed patch tasks exist for the target repository. If one is running, the agent links back to that Slack thread rather than creating duplicate work. If the requested package was already patched, it returns the existing PR URL, ensuring the model never wastes computation on redundant work.
The version selection logic demonstrates careful prompt engineering: the agent reads the lock file first to check if patching is needed, then picks the best version at or above the scanner-recommended minimum, preferring the latest patch in the same minor version and never crossing a major version boundary. The system supports npm, yarn, pnpm, pip, poetry, uv, go mod, Maven, and RubyGems with targeted lock regeneration for each ecosystem.
WebFetch grounding operates against an allowlist including NVD, OSV, npm, PyPI, GitHub advisories, and vendor PSIRTs. Fetched content is explicitly treated as untrusted data, with URLs cited in a Sources section in the PR. Safety hooks block dangerous operations—force-push, hard reset, rebase, rm -rf outside /tmp, secret dumps, SQL DROP statements—before they execute in the subprocess.
A critical operational improvement is bundled PRs: when multiple packages need patching in one repository, a single Claude Opus run produces one PR covering all changes. This eliminated the “100 CVEs → 100 PRs” sprawl from early iterations. Interactive amendments via patch_amend_pr allow reviewers to request changes mid-thread (“bump h11 down to 0.15.2 instead” or “remove all the highs from this PR”), with the bot pushing new commits to the same branch via a shorter amend subprocess (10-minute budget versus 30 for fresh patches).
The vuln-fix skill provides bulk entry: a command like @trm-security-agent fix my SLA CVEs queries BigQuery for the SLA-CVE table backing their Looker dashboard, groups by repository, deduplicates packages, skips anything with in-flight tasks, and enqueues one patch job per repository. With Cloud Run scaling to ten parallel instances, a single Slack command fans out across a team’s open vulnerabilities in parallel, typically completing the entire remediation wave in approximately five minutes.
The review workflow makes the system feel autonomous to engineers outside the security team. When anyone posts a TRM infrastructure-related PR URL in a channel the bot owns, automatic review triggers. The pipeline is explicitly designed for cost efficiency as much as quality, demonstrating sophisticated LLMOps cost management.
Each Slack message first passes through regex matching (essentially free). On match, a Haiku intent classifier determines in approximately 200ms whether the message actually requests review (PTAL, LGTM?) or represents casual chatter (merged, FYI, link in passing). Only on intent match does the system check the Firestore verdict cache, keyed by ${repo}-${pr}-${headSha}. New commits force fresh review automatically, ensuring the agent never approves based on stale code state.
When cache miss occurs, the system proceeds to Claude Opus for full review. The model emits a structured verdict—APPROVE, FLAG, or ABSTAIN—with reasoning. For a carefully curated set of low-risk patterns including additive read-only IAM on non-production resources, single-line egress allowlist additions, additive activate_apis, and terraform fmt-only diffs, the bot calls github_approve_pr as the trm-security-agent GitHub user. Notably, this is not a bot account but a real user with a personal access token, so CODEOWNERS rules listing @trm-security-agent are satisfied, demonstrating thoughtful integration with existing GitHub workflows.
For anything outside the approved patterns—organization-level IAM, deletions, owner roles, anything touching production—the agent calls github_comment_pr with categorized findings and tags the security-product team. Every action mirrors to the originating Slack thread. Agent approvals are never silent, maintaining transparency and audit trails.
At current PR volume, the review workflow costs approximately $5-12 daily in model spend. More importantly, it unblocks engineers at near-instant speed on well-structured, low-risk PRs that would otherwise wait for human reviewer availability. The cost optimization through the regex → Haiku → cache → Opus pipeline demonstrates mature thinking about production LLM economics: avoiding expensive model calls through cheap filters and caching where appropriate.
TRM’s knowledge management system, called Brain, represents a sophisticated approach to institutional memory that compounds over time through normal team operations. The knowledge base comprises curated internal documentation including runbooks, SOPs, incident response playbooks, end-user guides, and the SCF control catalog. It indexes in memory at startup and exposes to every skill through a search interface.
The interesting aspect is the closed improvement loop. During every triage, helpdesk reply, and PR review, a parallel tool called capture_brain_observation allows the agent to silently record observations like “this would have gone faster with a runbook on X.” Observations land in Firestore with rate limiting per category. When a security team member triggers brain_evolve_run, a Cloud Task reads stale observations (over 24 hours old), drafts new or improved runbooks via Claude Opus, and opens PRs against the brain documentation source. Every brain-evolve PR passes through human review before merge.
The flywheel operates both passively and actively. Engineers can drive it directly with phrases like “add this improvement to your todo list for the IR workflow” to capture targeted observations on demand, and “show me your todo list” surfaces them grouped by workflow. The brain doubles as a shared scratchpad both team and agent contribute to, creating a living documentation system that improves through use rather than requiring dedicated documentation maintenance cycles.
This approach to knowledge management demonstrates sophisticated LLMOps thinking: rather than treating documentation as a static input to the LLM, the system creates feedback loops where the LLM helps identify and fill documentation gaps based on actual operational friction encountered during real work.
Despite extensive autonomous capabilities, TRM maintains deliberate human gates on irreversible actions, demonstrating mature thinking about AI safety in production systems. High-blast-radius operations follow a Block Kit risk-report flow: the agent analyzes context up-front, posts a risk report with Proceed/Cancel buttons, and waits for human click before executing. Claude is never in the confirmation loop; Slack interactivity handles it. This design ensures that even if prompt injection or model failure causes the agent to propose a dangerous action, the actual execution requires human approval outside the LLM’s control.
Patch workers create PRs autonomously but never merge them; CODEOWNERS still applies to all code changes. The review workflow approves only the narrow patterns described earlier; anything ambiguous goes to FLAG status, routing to the security-product team. Brain-evolve PRs undergo review before merge. The team notes that auto-merge on green CI for the narrowest repeated patterns is on the H2 2026 roadmap, but currently every code-changing PR ends with a human clicking Merge.
This graduated autonomy approach—full autonomy for well-understood low-risk operations, human-in-the-loop for ambiguous cases, hard blocks on high-risk operations—represents a pragmatic middle ground between full automation and manual processes. The system captures value from automation where it’s safe while maintaining human oversight where it matters.
TRM treats their security agent as a production service rather than an automation script, which proved essential for multi-workflow scale. The V1 system used homegrown telemetry: custom counters in their application database, manually maintained dashboards, and separate code for pulling platform metrics. This worked for one workflow but became painful when comparing multiple workflows.
The V2 observability stack underwent two deliberate upgrades. First, they adopted OpenTelemetry GenAI semantic conventions, with every model call (orchestrator turns, classifiers, worker subprocesses) emitting spec-defined histograms for call duration, token usage by type, and tool calls per operation. Because it follows the public specification, the data is portable to any GenAI-aware backend without re-instrumentation.
Second, they deployed Groundcover AI Observability as the backend, exporting OTel data via OTLP. This immediately illuminated LLM performance and tool performance: per-agent latency, prompt cache hit ratio, token-type mix, p95 duration, and error rate across all 56 instrumented tools. Critically, they achieve this visibility without shipping prompt or response bodies to third-party vendors, important for a security tool whose prompts can include sensitive alert content.
The platform now maintains per-workflow operational dashboards the team uses daily, not just for blog posts. A persistent 180-day record of every workflow interaction enables ad-hoc analysis at any time. Cost transparency operates at the transaction level: helpdesk runs at $0.16 per interaction, PR review at $0.25 per interaction, alert triage at $0.29 per interaction. This granularity drives architectural decisions like the regex → Haiku → cache → Opus pipeline in the review workflow, designed specifically to avoid hitting Opus for intent classification.
Production metrics from the past 30 days show approximately 10,000 interactions across all workflows and approximately 51,000 LLM calls instrumented end-to-end. The helpdesk workflow’s “45% resolved without a ticket” represents the strict bar; every other interaction still does real work by pre-populating Zendesk tickets with full Slack thread context, so on-call staff don’t rebuild context from scratch. The assisted-resolution rate is meaningfully higher than the strict auto-resolution figure.
Security alerts triage shows approximately 238 alerts (17%) closed without paging anyone, with the remainder escalated cleanly with enriched context. The review workflow, being newest, is the fastest-growing, with every infrastructure PR URL posted in bot-owned channels flowing through the review pipeline before human pings occur.
TRM maintains over 1,100 QA tests that must pass before every merge, organized across three axes: performance (cache behavior, TTLs, cardinality), availability (timeout discipline, webhook correctness, deployment-shape invariants), and security (auth model, prompt-injection fences, SSRF allowlists, hard-blocks on destructive git operations). Every test reads like an incident postmortem, annotated with the bug, date, and channel where it manifested. Every production incident produces a new test file. Notably, the same class of incident hasn’t recurred, demonstrating that their test-driven approach to reliability actually prevents regressions.
This comprehensive test suite demonstrates mature software engineering practices applied to LLM systems. Rather than relying solely on prompt engineering or model capabilities to prevent failures, they’ve built systematic quality gates that catch issues before deployment. The incident-to-test pipeline creates a ratchet effect where the system becomes progressively more robust over time.
Cost optimization appears throughout the architecture as a first-class design concern rather than an afterthought. The regex → Haiku → cache → Opus pipeline in PR review exemplifies this: cheap regex filtering eliminates obvious non-reviews, Haiku handles intent classification for a fraction of Opus cost, SHA-keyed caching avoids re-reviewing unchanged code, and only validated review requests reach the expensive Opus tier.
Tool subsetting serves cost purposes beyond improving decision quality: fewer tools in the prompt means smaller context windows and lower per-interaction costs. The per-workflow cost transparency ($0.16-0.29 per interaction depending on workflow) enables clear ROI calculations: is the auto-resolution rate worth the spend? Where does caching pay off? What’s the right model tier for each step?
The Cloud Task execution pattern for long-running operations like patch jobs prevents tying up expensive HTTP request handlers during multi-minute coding sessions, allowing the system to scale horizontally across multiple instances efficiently. Setting maximum instance request concurrency to one for worker instances ensures isolation while allowing parallel execution across instances, avoiding both resource contention and idle capacity.
TRM surfaces several non-obvious operating disciplines that shape every skill on the platform. Human gates on irreversible actions ensure Claude never sits in the confirmation loop for destructive operations; Slack interactivity handles it. Cost transparency per workflow drives architectural trade-offs with data rather than intuition. High availability through environment isolation means a single bad push can’t take down production; the staging/prod split with per-branch revisions enables safe experimentation.
The “test every incident, prevent the next one” philosophy has produced 1,100+ tests organized by the incidents they prevent, with no incident class recurring. The institutional memory flywheel means the system gets smarter as the team works, with live operations continuously feeding back into improved documentation and runbooks.
The shift from measuring “remediation rate” to publishing per-workflow SLAs represents mature operational thinking. They stopped treating metrics as something to report in blog posts and started treating them as service-level objectives for a production internal service: throughput, success rate, latency, error rate, and user feedback per workflow against published SLAs.
TRM outlines three deep-dive follow-ups coming in this series: enterprise security and IT support (how they replaced ClearFeed with a Slack-native helpdesk built on the same agent), incident response (how the security agent became the on-call security responder from page to post-mortem), and GRC (how the agent automates continuous compliance and evidence collection across FedRAMP High and SOC 2 audits using the SCF control catalog).
The common thread across all three is an orchestrator that meets engineers in Slack, holds 14 skills behind a single interface, and is allowed to act—carefully—when the next step is unambiguous. Auto-merge on green CI for the narrowest repeated patterns appears on the H2 2026 roadmap as the obvious next step in graduated autonomy.
This case study represents sophisticated LLMOps engineering with several noteworthy strengths. The architectural evolution from single-purpose function to unified orchestrator demonstrates clear thinking about what generalizes across workflows. The tool subsetting approach addresses a real problem with large tool surfaces in production LLM systems. The safety controls show mature thinking about where to place human gates. The observability implementation using standard telemetry protocols creates portable, vendor-agnostic insights. The cost transparency and optimization strategies demonstrate treating LLM operations as a first-class infrastructure concern.
However, several aspects warrant balanced assessment. The 17% auto-resolution rate for security alerts, while presented positively, means 83% of alerts still require human intervention. Whether this represents appropriate conservatism or insufficient autonomy depends on the nature of the alerts being escalated. The 45% strict helpdesk auto-resolution (higher with assists) seems strong, but the case study doesn’t detail what types of requests remain challenging or what improvement trajectory looks like over time.
The replacement of reinforcement learning with general-purpose LLM prompting is described as “dramatically simpler” but the case study doesn’t address whether this maintains equivalent or better quality on the core vulnerability remediation task. The original RL approach trained on “months of merge / revert / comment signals” which provided explicit quality feedback; it’s unclear how the new approach captures equivalent learning signals beyond the brain-evolve documentation improvement loop.
The cost figures ($5-12 daily for PR review, $0.16-0.29 per interaction for various workflows) seem very reasonable, but the case study doesn’t provide comparison baselines for human labor costs or total workflow volume to assess true ROI. The 10,000 interactions monthly across all workflows could represent tremendous value or modest impact depending on what those interactions replaced and how much human time each saves.
The emphasis on Slack as the UI makes sense for their engineering culture but creates tight coupling to a specific communication platform. While this likely drives adoption and reduces friction, it would make the system difficult to port to organizations using different collaboration tools. The architecture appears Slack-native rather than having Slack as one interface among several.
Overall, this represents a mature, thoughtfully designed LLMOps implementation with clear operational discipline, appropriate safety boundaries, and sophisticated cost management. The evolution from single-workflow to multi-workflow platform demonstrates genuine architectural insight rather than simply adding features. The observability and testing practices show production-ready engineering rather than experimental automation. The balanced approach to autonomy—full automation where safe, enriched escalation where human judgment matters—seems pragmatic and likely to age well as the technology evolves.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.
Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.