This case study presents a comprehensive implementation of agentic AI in cybersecurity operations by Trellix, showcasing one of the most sophisticated production deployments of multi-agent LLM systems in the security domain. The presentation was delivered by Martin Holste, CTO for Cloud and AI at Trellix, alongside Jason Garman from AWS, demonstrating a mature partnership between a security vendor and cloud infrastructure provider.
**Problem Context and Business Challenge**
The fundamental challenge addressed by this implementation stems from the overwhelming volume of security alerts generated by modern security tools. Traditional Security Operations Centers (SOCs) face a critical resource constraint where human analysts can realistically investigate only about 10 high-quality alerts per day, while systems generate thousands of alerts daily. This creates a dangerous blind spot where sophisticated attackers can hide in low-priority alerts that never receive human attention. The speakers emphasized that the most skilled adversaries deliberately avoid triggering critical alerts, instead operating within the noise of routine security events.
The traditional approach of using static playbooks and rule-based automation proved insufficient because these systems lack the contextual understanding and adaptability needed to handle the nuanced nature of security investigations. Each security tool operates in isolation, providing limited visibility into the broader attack context, which makes it difficult to distinguish between genuine threats and false positives.
**Technical Architecture and LLMOps Implementation**
The solution employs a sophisticated multi-agent architecture built on AWS Bedrock, utilizing different models for different stages of the investigation process. The system architecture demonstrates several key LLMOps principles:
**Model Selection Strategy**: The implementation uses a tiered approach with different models optimized for specific tasks. Amazon Nova Micro handles initial alert classification due to its cost-effectiveness (100x cheaper than larger models) and speed, while Claude Sonnet performs detailed analysis requiring deeper reasoning capabilities. This strategic model selection balances cost, performance, and capability requirements - a critical consideration in production LLM deployments.
**Agent Architecture**: The system implements what the speakers describe as "agentic AI" - autonomous software systems that can reason, plan, and execute tasks using LLMs as a central decision-making brain. Unlike traditional playbooks that follow predetermined paths, these agents dynamically select tools and investigation paths based on the evidence they discover. The architecture includes:
- **Supervisory Agent**: Orchestrates the overall investigation process and coordinates between specialized sub-agents
- **Tool Selection**: Agents have access to a comprehensive toolbox of security tools and data sources, including AWS services like GuardDuty, Security Hub, and Security Lake, as well as third-party integrations
- **Memory Systems**: Both short-term and long-term memory capabilities allow agents to maintain context across investigation steps and learn from previous interactions
- **Dynamic Planning**: Agents can replan and adapt their investigation strategy based on findings, rather than following fixed workflows
**Prompt Engineering and Model Management**
The implementation demonstrates sophisticated prompt engineering practices that are crucial for production LLM deployments. The speakers emphasized several key principles:
**Structured Prompting Framework**: The system uses a carefully structured prompting approach that treats each LLM interaction as onboarding a new employee who will exist for only 15 seconds. This framework includes:
- Clear goal definition and company policies
- Specific step-by-step instructions using ordinal numbers
- Explicit return format specifications
- Warnings and guidance based on previous learnings
- Raw data injection at the end of the prompt
**Constrained Decision Making**: Rather than allowing models to generate free-form responses, the system constrains LLM outputs to specific choices from predefined lists. This approach significantly reduces hallucinations and ensures consistent, predictable behavior - a critical requirement for production security applications.
**Context Integration**: The system leverages the models' pre-trained knowledge of security frameworks like MITRE ATT&CK, combined with tactical information specific to the organization's environment. This allows the AI to understand both the theoretical aspects of threats and the practical implications within the specific organizational context.
**Data Integration and Tool Orchestration**
The solution addresses one of the most challenging aspects of LLMOps: integrating diverse data sources and external tools. The system demonstrates several advanced capabilities:
**Multi-Source Data Correlation**: The agents can correlate information across multiple security platforms including network monitoring tools, endpoint detection and response (EDR) systems, vulnerability management platforms, and threat intelligence feeds. This comprehensive data integration is essential for accurate threat assessment but represents a significant technical challenge in production environments.
**Tool Registry and Dynamic Selection**: The system maintains a registry of available tools and their capabilities, allowing agents to dynamically select appropriate tools based on the investigation needs. This includes both AWS native services and third-party integrations through their 500+ partner ecosystem.
**Real-time Decision Making**: The system processes alerts in real-time, performing comprehensive investigations that would typically require human analysts hours or days to complete. The agents can execute complex investigation workflows including sandbox analysis, threat intelligence lookups, and remediation planning.
**Model Evaluation and Quality Assurance**
The implementation includes sophisticated evaluation frameworks specifically designed for LLM-based security applications:
**Custom Benchmarking**: Rather than relying on generic model benchmarks, Trellix developed security-specific evaluation criteria. This includes testing the models' ability to decode and understand encoded commands (like Base64), correlate threat intelligence, and make accurate threat assessments.
**Quality Metrics**: The system measures response quality across multiple dimensions, including the accuracy of threat classification, the completeness of investigation reports, and the appropriateness of recommended actions. The speakers emphasized that traditional software QA approaches don't apply to LLM systems due to their non-deterministic nature.
**Continuous Improvement**: The system incorporates feedback mechanisms that allow it to learn from analyst decisions and case outcomes. This includes reading through closed case notes to understand analyst preferences and generating dynamic guidance for future investigations.
**Production Deployment Considerations**
The case study reveals several critical considerations for deploying LLMs in production security environments:
**Privacy and Data Isolation**: Trellix made a deliberate decision not to fine-tune models with customer data, ensuring complete privacy isolation between customers. This approach also prevents the system from learning bad behaviors or biases from individual customer environments.
**Human-in-the-Loop Integration**: While the system can operate autonomously, it includes configurable human validation points for critical actions. Organizations can define risk thresholds that determine when human approval is required before taking remediation actions.
**Transparency and Auditability**: Every decision made by the system is fully documented and auditable. The agents generate detailed reports explaining their reasoning, the evidence they considered, and the actions they recommend. This transparency is crucial for security operations where accountability is paramount.
**Cost Optimization**: The implementation demonstrates sophisticated cost management through model selection, caching strategies, and tiered processing approaches. The speakers emphasized that cost considerations can significantly impact the feasibility of LLM deployments at scale.
**Performance and Reliability**
The system has been in production for approximately 18 months, providing substantial real-world validation. Key performance achievements include:
**Alert Processing Scale**: The system can process thousands of alerts daily, providing detailed investigation reports for each one. This represents a significant improvement over traditional approaches where most alerts go uninvestigated.
**Investigation Quality**: The generated investigation reports are sophisticated enough to be used for training human analysts, demonstrating the high quality of the analysis. The system can perform complex reasoning tasks like decoding encoded commands and correlating them with threat intelligence.
**Adaptability**: The system has demonstrated the ability to handle custom alerts and investigation requirements, making it suitable for diverse organizational needs and threat landscapes.
**Lessons Learned and Best Practices**
The presentation shared several important lessons learned from production deployment:
**Model Selection Importance**: The choice of model significantly impacts both performance and cost. Organizations should develop evaluation frameworks specific to their use cases rather than relying on generic benchmarks.
**Prompt Engineering Criticality**: The quality of prompts directly impacts system performance. The speakers emphasized the importance of structured, constrained prompting approaches for production deployments.
**Integration Complexity**: While the technical implementation is sophisticated, the real challenge lies in integrating diverse data sources and maintaining consistent data quality across multiple platforms.
**Change Management**: The transition from traditional SOC operations to AI-powered investigation requires significant change management and analyst training.
This case study represents one of the most comprehensive examples of agentic AI deployment in production, demonstrating how sophisticated LLM systems can be successfully implemented in critical business applications while maintaining the reliability, transparency, and performance requirements necessary for cybersecurity operations.