ZenML

AI Agent Automation of Security Operations Center Analysis

Doppel 2025
View original source

Doppel implemented an AI agent using OpenAI's o1 model to automate the analysis of potential security threats in their Security Operations Center (SOC). The system processes over 10 million websites, social media accounts, and mobile apps daily to identify phishing attacks. Through a combination of initial expert knowledge transfer and training on historical decisions, the AI agent achieved human-level performance, reducing SOC workloads by 30% within 30 days while maintaining lower false-positive rates than human analysts.

Industry

Tech

Technologies

Overview

Doppel is a cybersecurity company that provides an AI-powered social engineering defense platform. Their services include brand protection, executive protection, phishing simulation, and connecting customer-detected scams to takedown actions. The company recently announced a $70M Series C funding round, indicating significant traction in the market. This case study, published in January 2025, describes how Doppel deployed an AI agent using OpenAI’s o1 model to automate portions of their Security Operations Center (SOC) workload, claiming a 30% reduction in manual work within 30 days of deployment.

The case study is presented from the perspective of a company promoting its own technology and success, so the claims should be viewed with appropriate skepticism. However, the technical approaches described offer useful insights into production LLM deployment for security operations.

The Problem: Scale and Complexity in Security Operations

Doppel’s platform ingests more than 10 million data points daily, including websites, social media accounts, and mobile applications, to identify phishing attacks worldwide. While traditional machine learning models can effectively filter out obvious false positives at this scale, the company identified that nuanced decision-making remained a significant bottleneck requiring human analyst intervention.

The decisions that required human judgment were complex and multi-faceted. Analysts had to determine whether a detected site warranted a takedown, identify which platforms should be targeted, and determine under which policies the takedown should be requested. These decisions required interpreting unstructured data such as screenshots, time-series activity patterns, and customer-specific policies. Critically, analysts also needed to explain their rationale for each decision, adding another layer of complexity.

The stakes were high: incorrect decisions could either miss genuine threats or disrupt legitimate activity. Even for trained human analysts, achieving consistent accuracy required extensive knowledge and ongoing training.

The Solution: Training an AI Agent as a Cybersecurity Expert

Doppel’s approach to building their AI agent involved several key LLMOps practices that are worth examining in detail.

Knowledge Transfer from Human Training Materials

The initial phase of development involved what Doppel describes as “knowledge transfer.” They took the same training methods and materials used to train human analysts—covering topics like phishing, malware, and brand abuse—and applied them directly to training the AI model. This approach mirrors the concept of prompt engineering and fine-tuning, where domain expertise is encoded into the model’s behavior through carefully curated inputs. The case study notes this produced a “noticeable jump in performance” but that the AI still struggled with non-obvious scenarios.

Incorporating Historical Decisions

The breakthrough, according to Doppel, came when they incorporated “thousands of well-curated historical decisions” into the model. This approach effectively distilled years of analyst experience into the AI agent. This technique resembles few-shot learning or fine-tuning approaches where high-quality labeled data from previous human decisions is used to teach the model the nuances of expert judgment. The emphasis on “well-curated” suggests significant effort went into data quality assurance for this training dataset.

This approach is notable for production LLM systems because it addresses a common challenge: getting models to handle edge cases that require domain expertise beyond what general pre-training provides.

Continuous Learning and Feedback Loops

The case study emphasizes that the agent is “constantly learning as it sees new examples.” This represents a critical LLMOps consideration: maintaining model performance over time as the threat landscape evolves. Phishing attacks are described as “a high-speed cat-and-mouse game,” making continuous adaptation essential.

However, the text does not provide specific technical details about how this continuous learning is implemented. Key questions that remain unanswered include: How frequently is the model retrained or updated? What mechanisms ensure new learning doesn’t degrade performance on previously mastered scenarios? How are new examples curated and validated before being used for training? The lack of detail here suggests this may be an area where the implementation is still evolving.

Model Selection: OpenAI’s o1 Model

Doppel selected OpenAI’s o1 model for their AI agent, which was showcased at OpenAI DevDay. The o1 model family is notable for its enhanced reasoning capabilities compared to earlier GPT models, using chain-of-thought processing to work through complex problems. This choice aligns with the described use case: the decisions being automated require judgment and reasoning, not just pattern recognition.

The selection of a reasoning-focused model suggests that Doppel’s prompting strategy likely leverages the model’s ability to think through multi-step decisions, though specific prompting techniques are not disclosed.

Production Deployment and Results

Claimed Performance

Doppel claims their AI agent “exceeded human-level benchmarks” with a lower false-positive rate and higher detection of genuine threats compared to human analysts. These are significant claims, though the text provides no specific metrics, sample sizes, or details about how the benchmarks were established and measured.

From an LLMOps perspective, the evaluation methodology is a critical gap in this case study. Questions such as how the comparison was conducted, whether it was on a holdout test set or live production traffic, and how statistical significance was determined remain unanswered. The claim of “human-level” performance is compelling marketing but would benefit from more rigorous documentation.

Workload Reduction

The headline claim is a 30% reduction in SOC workload within 30 days. This suggests the deployment was rapid and the impact was measurable quickly. However, the text does not clarify what metrics were used to measure “workload”—whether it’s analyst hours, number of cases processed, or some other operational metric.

Operational Benefits

For Doppel’s own operations, the deployment allowed human analysts to focus on complex threat patterns while AI handled routine decisions at scale. For customers, the benefits are described as faster response times and more threats eliminated. These benefits are plausible outcomes of successful automation but again lack specific quantification.

LLMOps Considerations and Lessons

Hybrid Architecture

The described system represents a hybrid approach where traditional ML handles initial filtering and LLM-based agents handle nuanced decision-making. This is a pragmatic architecture that leverages each technology’s strengths: traditional ML for high-throughput, low-latency pattern matching, and LLMs for complex reasoning tasks.

Explainability

The case study notes that decisions require explanation of rationale. LLMs, particularly reasoning-focused models like o1, are well-suited for generating explanations alongside decisions. This is an important consideration for security operations where audit trails and decision justification may be required for compliance or post-incident analysis.

Risk Management

The text acknowledges the high stakes of these decisions: missing threats or disrupting legitimate activity. However, it does not describe what safeguards are in place for the AI agent’s decisions. Questions about human oversight, confidence thresholds for automated action, and rollback mechanisms are not addressed.

Data Handling

Processing screenshots, time-series data, and unstructured content requires multi-modal capabilities or sophisticated preprocessing. The text does not detail how these data types are prepared for the LLM or whether multi-modal models are involved.

Critical Assessment

While this case study describes an interesting application of LLMs in security operations, several aspects warrant skepticism:

That said, the general approach described—using domain expertise to train agents, incorporating historical decisions, and maintaining continuous learning loops—represents sound LLMOps practices that are applicable across industries.

Future Directions

Doppel indicates they are “just getting started” and their engineering team is “re-imagining what’s possible in the SOC using AI agents from the ground up.” This suggests ongoing investment in LLM-based automation for security operations, with potential expansion of automated capabilities beyond the initial 30% of workload currently addressed.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI Agent System for Automated Security Investigation and Alert Triage

Slack 2025

Slack's Security Engineering team developed an AI agent system to automate the investigation of security alerts from their event ingestion pipeline that handles billions of events daily. The solution evolved from a single-prompt prototype to a multi-agent architecture with specialized personas (Director, domain Experts, and a Critic) that work together through structured output tasks to investigate security incidents. The system uses a "knowledge pyramid" approach where information flows upward from token-intensive data gathering to high-level decision making, allowing strategic use of different model tiers. Results include transformed on-call workflows from manual evidence gathering to supervision of agent teams, interactive verifiable reports, and emergent discovery capabilities where agents spontaneously identified security issues beyond the original alert scope, such as discovering credential exposures during unrelated investigations.

fraud_detection content_moderation classification +27

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44