ZenML

Lessons from Red Teaming 100+ Generative AI Products

Microsoft 2025
View original source

Microsoft's AI Red Team (AIRT) conducted extensive red teaming operations on over 100 generative AI products to assess their safety and security. The team developed a comprehensive threat model ontology and leveraged both manual and automated testing approaches through their PyRIT framework. Through this process, they identified key lessons about AI system vulnerabilities, the importance of human expertise in red teaming, and the challenges of measuring responsible AI impacts. The findings highlight both traditional security risks and novel AI-specific attack vectors that need to be considered when deploying AI systems in production.

Industry

Tech

Technologies

Overview

Microsoft’s AI Red Team (AIRT) presents a comprehensive overview of their experience red teaming over 100 generative AI products, providing critical insights into how LLMs and other AI systems can be tested for safety and security vulnerabilities in production environments. The team was officially established in 2018 and has evolved significantly as AI capabilities have expanded, particularly following the release of ChatGPT in 2022 and the subsequent proliferation of AI copilots and agentic systems.

This paper is particularly valuable for LLMOps practitioners because it addresses the practical realities of securing AI systems in production, moving beyond academic benchmarks to real-world vulnerability assessment. The work spans both traditional security concerns (data exfiltration, privilege escalation) and AI-specific responsible AI (RAI) harms (harmful content generation, bias, psychosocial impacts).

Threat Model Ontology

AIRT developed a structured ontology for modeling GenAI system vulnerabilities that is essential for organizing red teaming operations. The ontology consists of several key components:

Key Lessons for LLMOps

Lesson 1: Context-Aware Testing

The paper emphasizes that effective red teaming must consider both what the AI system can do (capability constraints) and where it is applied (downstream applications). Larger models often acquire capabilities that introduce new attack vectors—for example, understanding advanced encodings like base64 or ASCII art that can be exploited for malicious instructions. The same model deployed as a creative writing assistant versus a healthcare records system requires fundamentally different risk assessments.

Lesson 2: Simplicity Over Sophistication

A crucial finding for production AI systems is that “real attackers don’t compute gradients, they prompt engineer.” Gradient-based adversarial methods, while academically interesting, are computationally expensive and typically require full model access that commercial systems don’t provide. Simple techniques like manually crafted jailbreaks (Skeleton Key, Crescendo) and basic image manipulations often work better in practice. This has significant implications for LLMOps: defense strategies should prioritize protection against simple attacks that are actually likely to be attempted by real adversaries.

The paper advocates for a system-level adversarial mindset. AI models are deployed within broader systems including infrastructure, input filters, databases, and cloud resources. Attacks that combine multiple techniques across the system stack are often most effective. One example describes an attack that used low-resource language prompt injections for reconnaissance, cross-prompt injection to generate malicious scripts, and then executed code to exfiltrate private data.

Lesson 3: Beyond Benchmarking

AI red teaming is fundamentally different from safety benchmarking. Benchmarks measure preexisting notions of harm on curated datasets, while red teaming explores unfamiliar scenarios and helps define novel harm categories. The paper describes investigating how LLMs could be weaponized for automated scamming—connecting jailbroken models to text-to-speech systems to create end-to-end scam operations. This represents a category of harm that wouldn’t be captured by traditional benchmarks.

Lesson 4: Automation with PyRIT

To address the challenge of testing at scale, Microsoft developed PyRIT (Python Risk Identification Tool), an open-source framework for AI red teaming. PyRIT provides several key components:

The framework enables coverage of the risk landscape that would be impossible with fully manual testing while accounting for the non-deterministic nature of AI models. However, the paper is careful to note that PyRIT is a tool that leverages the same powerful capabilities it tests against—uncensored models can be used to automatically jailbreak target systems.

Lesson 5: Human-in-the-Loop Requirement

Despite automation capabilities, human judgment remains essential in AI red teaming. Subject matter experts are needed for specialized domains like medicine, cybersecurity, and CBRN (chemical, biological, radiological, nuclear) content. Cultural competence is critical as AI systems are deployed globally—harm definitions vary across political and cultural contexts, and most AI safety research has been conducted in Western, English-dominant contexts.

Emotional intelligence is perhaps the most uniquely human contribution: assessing how model responses might be interpreted in different contexts, whether outputs feel uncomfortable, and how systems respond to users in distress (depressive thoughts, self-harm ideation). The paper acknowledges that red teamers may be exposed to disturbing content and emphasizes the importance of mental health support and processes for disengagement.

Lesson 6: Responsible AI Challenges

RAI harms present unique measurement challenges compared to security vulnerabilities. Key issues include:

The paper distinguishes between adversarial actors who deliberately subvert guardrails and benign users who inadvertently trigger harmful content. The latter case may actually be worse because it represents failures the system should prevent without requiring attack techniques.

Lesson 7: Amplified and Novel Security Risks

LLMs both amplify existing security risks and introduce new ones. Traditional application security vulnerabilities (outdated dependencies, improper error handling, lack of input sanitization) remain critical. The paper describes discovering a token-length side channel in GPT-4 and Microsoft Copilot that allowed adversaries to reconstruct encrypted LLM responses—an attack that exploited transmission methods rather than the AI model itself.

AI-specific vulnerabilities include cross-prompt injection attacks (XPIA) in RAG architectures, where malicious instructions hidden in retrieved documents can alter model behavior or exfiltrate data. The paper notes that defenses require both system-level mitigations (input sanitization) and model-level improvements (instruction hierarchies), but emphasizes that fundamental limitations mean one must assume any LLM supplied with untrusted input will produce arbitrary output.

Lesson 8: Continuous Security Posture

The paper pushes back against the notion that AI safety is a solvable technical problem. Drawing parallels to cybersecurity economics, the goal is to increase the cost required to successfully attack a system beyond the value an attacker would gain. Theoretical and experimental research shows that for any output with non-zero probability of generation, a sufficiently long prompt exists that will elicit it. RLHF and other alignment techniques make jailbreaking more difficult but not impossible.

The paper advocates for break-fix cycles—multiple rounds of red teaming and mitigation—and purple teaming approaches that continually apply offensive and defensive strategies. The aspiration is that prompt injections become like the buffer overflows of the early 2000s: not eliminated entirely, but largely mitigated through defense-in-depth and secure-first design.

Case Studies

The paper includes five detailed case studies demonstrating the ontology in practice:

Operational Statistics

Since 2021, AIRT has conducted over 80 operations covering more than 100 products. The breakdown shows evolution from security-focused assessments in 2021 toward increasing RAI testing in subsequent years, while maintaining security assessment capabilities. Products tested include both standalone models and integrated systems (copilots, plugins, AI applications).

Open Questions

The paper identifies several areas for future development relevant to LLMOps practitioners:

This work represents one of the most comprehensive public accounts of production AI red teaming at scale, providing actionable frameworks for organizations seeking to secure their own generative AI deployments.

More Like This

Large-Scale AI Assistant Deployment with Safety-First Evaluation Approach

Discord 2023

Discord implemented Clyde AI, a chatbot assistant that was deployed to over 200 million users, focusing heavily on safety, security, and evaluation practices. The team developed a comprehensive evaluation framework using simple, deterministic tests and metrics, implemented through their open-source tool PromptFu. They faced unique challenges in preventing harmful content and jailbreaks, leading to innovative solutions in red teaming and risk assessment, while maintaining a balance between casual user interaction and safety constraints.

chatbot content_moderation high_stakes_application +18

Scaling Generative AI in Gaming: From Safety to Creation Tools

Roblox 2023

Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.

content_moderation code_generation speech_recognition +35

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting 2025

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

chatbot question_answering content_moderation +21