ZenML

AI-Powered Security Operations Center with Agentic AI for Threat Detection and Response

Trellix 2025
View original source

Trellix, in partnership with AWS, developed an AI-powered Security Operations Center (SOC) using agentic AI to address the challenge of overwhelming security alerts that human analysts cannot effectively process. The solution leverages AWS Bedrock with multiple models (Amazon Nova for classification, Claude Sonnet for analysis) to automatically investigate security alerts, correlate data across multiple sources, and provide detailed threat assessments. The system uses a multi-agent architecture where AI agents autonomously select tools, gather context from various security platforms, and generate comprehensive incident reports, significantly reducing the burden on human analysts while improving threat detection accuracy.

Industry

Tech

Technologies

This case study presents a comprehensive implementation of agentic AI in cybersecurity operations by Trellix, showcasing one of the most sophisticated production deployments of multi-agent LLM systems in the security domain. The presentation was delivered by Martin Holste, CTO for Cloud and AI at Trellix, alongside Jason Garman from AWS, demonstrating a mature partnership between a security vendor and cloud infrastructure provider.

Problem Context and Business Challenge

The fundamental challenge addressed by this implementation stems from the overwhelming volume of security alerts generated by modern security tools. Traditional Security Operations Centers (SOCs) face a critical resource constraint where human analysts can realistically investigate only about 10 high-quality alerts per day, while systems generate thousands of alerts daily. This creates a dangerous blind spot where sophisticated attackers can hide in low-priority alerts that never receive human attention. The speakers emphasized that the most skilled adversaries deliberately avoid triggering critical alerts, instead operating within the noise of routine security events.

The traditional approach of using static playbooks and rule-based automation proved insufficient because these systems lack the contextual understanding and adaptability needed to handle the nuanced nature of security investigations. Each security tool operates in isolation, providing limited visibility into the broader attack context, which makes it difficult to distinguish between genuine threats and false positives.

Technical Architecture and LLMOps Implementation

The solution employs a sophisticated multi-agent architecture built on AWS Bedrock, utilizing different models for different stages of the investigation process. The system architecture demonstrates several key LLMOps principles:

Model Selection Strategy: The implementation uses a tiered approach with different models optimized for specific tasks. Amazon Nova Micro handles initial alert classification due to its cost-effectiveness (100x cheaper than larger models) and speed, while Claude Sonnet performs detailed analysis requiring deeper reasoning capabilities. This strategic model selection balances cost, performance, and capability requirements - a critical consideration in production LLM deployments.

Agent Architecture: The system implements what the speakers describe as “agentic AI” - autonomous software systems that can reason, plan, and execute tasks using LLMs as a central decision-making brain. Unlike traditional playbooks that follow predetermined paths, these agents dynamically select tools and investigation paths based on the evidence they discover. The architecture includes:

Prompt Engineering and Model Management

The implementation demonstrates sophisticated prompt engineering practices that are crucial for production LLM deployments. The speakers emphasized several key principles:

Structured Prompting Framework: The system uses a carefully structured prompting approach that treats each LLM interaction as onboarding a new employee who will exist for only 15 seconds. This framework includes:

Constrained Decision Making: Rather than allowing models to generate free-form responses, the system constrains LLM outputs to specific choices from predefined lists. This approach significantly reduces hallucinations and ensures consistent, predictable behavior - a critical requirement for production security applications.

Context Integration: The system leverages the models’ pre-trained knowledge of security frameworks like MITRE ATT&CK, combined with tactical information specific to the organization’s environment. This allows the AI to understand both the theoretical aspects of threats and the practical implications within the specific organizational context.

Data Integration and Tool Orchestration

The solution addresses one of the most challenging aspects of LLMOps: integrating diverse data sources and external tools. The system demonstrates several advanced capabilities:

Multi-Source Data Correlation: The agents can correlate information across multiple security platforms including network monitoring tools, endpoint detection and response (EDR) systems, vulnerability management platforms, and threat intelligence feeds. This comprehensive data integration is essential for accurate threat assessment but represents a significant technical challenge in production environments.

Tool Registry and Dynamic Selection: The system maintains a registry of available tools and their capabilities, allowing agents to dynamically select appropriate tools based on the investigation needs. This includes both AWS native services and third-party integrations through their 500+ partner ecosystem.

Real-time Decision Making: The system processes alerts in real-time, performing comprehensive investigations that would typically require human analysts hours or days to complete. The agents can execute complex investigation workflows including sandbox analysis, threat intelligence lookups, and remediation planning.

Model Evaluation and Quality Assurance

The implementation includes sophisticated evaluation frameworks specifically designed for LLM-based security applications:

Custom Benchmarking: Rather than relying on generic model benchmarks, Trellix developed security-specific evaluation criteria. This includes testing the models’ ability to decode and understand encoded commands (like Base64), correlate threat intelligence, and make accurate threat assessments.

Quality Metrics: The system measures response quality across multiple dimensions, including the accuracy of threat classification, the completeness of investigation reports, and the appropriateness of recommended actions. The speakers emphasized that traditional software QA approaches don’t apply to LLM systems due to their non-deterministic nature.

Continuous Improvement: The system incorporates feedback mechanisms that allow it to learn from analyst decisions and case outcomes. This includes reading through closed case notes to understand analyst preferences and generating dynamic guidance for future investigations.

Production Deployment Considerations

The case study reveals several critical considerations for deploying LLMs in production security environments:

Privacy and Data Isolation: Trellix made a deliberate decision not to fine-tune models with customer data, ensuring complete privacy isolation between customers. This approach also prevents the system from learning bad behaviors or biases from individual customer environments.

Human-in-the-Loop Integration: While the system can operate autonomously, it includes configurable human validation points for critical actions. Organizations can define risk thresholds that determine when human approval is required before taking remediation actions.

Transparency and Auditability: Every decision made by the system is fully documented and auditable. The agents generate detailed reports explaining their reasoning, the evidence they considered, and the actions they recommend. This transparency is crucial for security operations where accountability is paramount.

Cost Optimization: The implementation demonstrates sophisticated cost management through model selection, caching strategies, and tiered processing approaches. The speakers emphasized that cost considerations can significantly impact the feasibility of LLM deployments at scale.

Performance and Reliability

The system has been in production for approximately 18 months, providing substantial real-world validation. Key performance achievements include:

Alert Processing Scale: The system can process thousands of alerts daily, providing detailed investigation reports for each one. This represents a significant improvement over traditional approaches where most alerts go uninvestigated.

Investigation Quality: The generated investigation reports are sophisticated enough to be used for training human analysts, demonstrating the high quality of the analysis. The system can perform complex reasoning tasks like decoding encoded commands and correlating them with threat intelligence.

Adaptability: The system has demonstrated the ability to handle custom alerts and investigation requirements, making it suitable for diverse organizational needs and threat landscapes.

Lessons Learned and Best Practices

The presentation shared several important lessons learned from production deployment:

Model Selection Importance: The choice of model significantly impacts both performance and cost. Organizations should develop evaluation frameworks specific to their use cases rather than relying on generic benchmarks.

Prompt Engineering Criticality: The quality of prompts directly impacts system performance. The speakers emphasized the importance of structured, constrained prompting approaches for production deployments.

Integration Complexity: While the technical implementation is sophisticated, the real challenge lies in integrating diverse data sources and maintaining consistent data quality across multiple platforms.

Change Management: The transition from traditional SOC operations to AI-powered investigation requires significant change management and analyst training.

This case study represents one of the most comprehensive examples of agentic AI deployment in production, demonstrating how sophisticated LLM systems can be successfully implemented in critical business applications while maintaining the reliability, transparency, and performance requirements necessary for cybersecurity operations.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49

Running LLM Agents in Production for Accounting Automation

Digits 2025

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

healthcare fraud_detection customer_support +50