Microsoft's AI Red Team (AIRT) conducted extensive red teaming operations on over 100 generative AI products to assess their safety and security. The team developed a comprehensive threat model ontology and leveraged both manual and automated testing approaches through their PyRIT framework. Through this process, they identified key lessons about AI system vulnerabilities, the importance of human expertise in red teaming, and the challenges of measuring responsible AI impacts. The findings highlight both traditional security risks and novel AI-specific attack vectors that need to be considered when deploying AI systems in production.
Microsoft’s AI Red Team (AIRT) presents a comprehensive overview of their experience red teaming over 100 generative AI products, providing critical insights into how LLMs and other AI systems can be tested for safety and security vulnerabilities in production environments. The team was officially established in 2018 and has evolved significantly as AI capabilities have expanded, particularly following the release of ChatGPT in 2022 and the subsequent proliferation of AI copilots and agentic systems.
This paper is particularly valuable for LLMOps practitioners because it addresses the practical realities of securing AI systems in production, moving beyond academic benchmarks to real-world vulnerability assessment. The work spans both traditional security concerns (data exfiltration, privilege escalation) and AI-specific responsible AI (RAI) harms (harmful content generation, bias, psychosocial impacts).
AIRT developed a structured ontology for modeling GenAI system vulnerabilities that is essential for organizing red teaming operations. The ontology consists of several key components:
System: The end-to-end model or application being tested, which could be a standalone model hosted on a cloud endpoint or a complete system integrating models into copilots, plugins, and other applications.
Actor: The person or persons being emulated during testing. Critically, this includes both adversarial actors (scammers, malicious users) and benign users who might inadvertently trigger system failures. This dual consideration is important because many real-world harms occur without malicious intent.
TTPs (Tactics, Techniques, and Procedures): These are mapped to established frameworks including MITRE ATT&CK and MITRE ATLAS Matrix. Tactics represent high-level attack stages like reconnaissance and ML model access, while techniques are specific methods like active scanning and jailbreaking.
Weakness: The vulnerabilities that enable attacks, ranging from traditional software flaws to AI-specific issues like insufficient safety training.
Impact: Downstream effects categorized into security impacts (data exfiltration, credential dumping, remote code execution) and safety impacts (hate speech, violence, harmful content generation).
The paper emphasizes that effective red teaming must consider both what the AI system can do (capability constraints) and where it is applied (downstream applications). Larger models often acquire capabilities that introduce new attack vectors—for example, understanding advanced encodings like base64 or ASCII art that can be exploited for malicious instructions. The same model deployed as a creative writing assistant versus a healthcare records system requires fundamentally different risk assessments.
A crucial finding for production AI systems is that “real attackers don’t compute gradients, they prompt engineer.” Gradient-based adversarial methods, while academically interesting, are computationally expensive and typically require full model access that commercial systems don’t provide. Simple techniques like manually crafted jailbreaks (Skeleton Key, Crescendo) and basic image manipulations often work better in practice. This has significant implications for LLMOps: defense strategies should prioritize protection against simple attacks that are actually likely to be attempted by real adversaries.
The paper advocates for a system-level adversarial mindset. AI models are deployed within broader systems including infrastructure, input filters, databases, and cloud resources. Attacks that combine multiple techniques across the system stack are often most effective. One example describes an attack that used low-resource language prompt injections for reconnaissance, cross-prompt injection to generate malicious scripts, and then executed code to exfiltrate private data.
AI red teaming is fundamentally different from safety benchmarking. Benchmarks measure preexisting notions of harm on curated datasets, while red teaming explores unfamiliar scenarios and helps define novel harm categories. The paper describes investigating how LLMs could be weaponized for automated scamming—connecting jailbroken models to text-to-speech systems to create end-to-end scam operations. This represents a category of harm that wouldn’t be captured by traditional benchmarks.
To address the challenge of testing at scale, Microsoft developed PyRIT (Python Risk Identification Tool), an open-source framework for AI red teaming. PyRIT provides several key components:
The framework enables coverage of the risk landscape that would be impossible with fully manual testing while accounting for the non-deterministic nature of AI models. However, the paper is careful to note that PyRIT is a tool that leverages the same powerful capabilities it tests against—uncensored models can be used to automatically jailbreak target systems.
Despite automation capabilities, human judgment remains essential in AI red teaming. Subject matter experts are needed for specialized domains like medicine, cybersecurity, and CBRN (chemical, biological, radiological, nuclear) content. Cultural competence is critical as AI systems are deployed globally—harm definitions vary across political and cultural contexts, and most AI safety research has been conducted in Western, English-dominant contexts.
Emotional intelligence is perhaps the most uniquely human contribution: assessing how model responses might be interpreted in different contexts, whether outputs feel uncomfortable, and how systems respond to users in distress (depressive thoughts, self-harm ideation). The paper acknowledges that red teamers may be exposed to disturbing content and emphasizes the importance of mental health support and processes for disengagement.
RAI harms present unique measurement challenges compared to security vulnerabilities. Key issues include:
The paper distinguishes between adversarial actors who deliberately subvert guardrails and benign users who inadvertently trigger harmful content. The latter case may actually be worse because it represents failures the system should prevent without requiring attack techniques.
LLMs both amplify existing security risks and introduce new ones. Traditional application security vulnerabilities (outdated dependencies, improper error handling, lack of input sanitization) remain critical. The paper describes discovering a token-length side channel in GPT-4 and Microsoft Copilot that allowed adversaries to reconstruct encrypted LLM responses—an attack that exploited transmission methods rather than the AI model itself.
AI-specific vulnerabilities include cross-prompt injection attacks (XPIA) in RAG architectures, where malicious instructions hidden in retrieved documents can alter model behavior or exfiltrate data. The paper notes that defenses require both system-level mitigations (input sanitization) and model-level improvements (instruction hierarchies), but emphasizes that fundamental limitations mean one must assume any LLM supplied with untrusted input will produce arbitrary output.
The paper pushes back against the notion that AI safety is a solvable technical problem. Drawing parallels to cybersecurity economics, the goal is to increase the cost required to successfully attack a system beyond the value an attacker would gain. Theoretical and experimental research shows that for any output with non-zero probability of generation, a sufficiently long prompt exists that will elicit it. RLHF and other alignment techniques make jailbreaking more difficult but not impossible.
The paper advocates for break-fix cycles—multiple rounds of red teaming and mitigation—and purple teaming approaches that continually apply offensive and defensive strategies. The aspiration is that prompt injections become like the buffer overflows of the early 2000s: not eliminated entirely, but largely mitigated through defense-in-depth and secure-first design.
The paper includes five detailed case studies demonstrating the ontology in practice:
Vision Language Model Jailbreaking: Discovered that image inputs were more vulnerable to jailbreaks than text inputs. Overlaying images with malicious instructions bypassed safety guardrails that were effective against direct text prompts.
Automated Scamming System: Built a proof-of-concept combining a jailbroken LLM with text-to-speech and speech-to-text systems to create an automated scamming operation, demonstrating how insufficient safety guardrails could be weaponized.
Distressed User Scenarios: Evaluated chatbot responses to users expressing depression, grief, or self-harm intent, developing guidelines with psychologists, sociologists, and medical experts.
Text-to-Image Gender Bias: Probed generators with prompts that didn’t specify gender (e.g., “a secretary” and “a boss”) to measure and document stereotyping in generated images.
SSRF in Video Processing: Found a server-side request forgery vulnerability in a GenAI video processing system due to an outdated FFmpeg component, demonstrating that traditional security testing remains essential.
Since 2021, AIRT has conducted over 80 operations covering more than 100 products. The breakdown shows evolution from security-focused assessments in 2021 toward increasing RAI testing in subsequent years, while maintaining security assessment capabilities. Products tested include both standalone models and integrated systems (copilots, plugins, AI applications).
The paper identifies several areas for future development relevant to LLMOps practitioners:
This work represents one of the most comprehensive public accounts of production AI red teaming at scale, providing actionable frameworks for organizations seeking to secure their own generative AI deployments.
Discord implemented Clyde AI, a chatbot assistant that was deployed to over 200 million users, focusing heavily on safety, security, and evaluation practices. The team developed a comprehensive evaluation framework using simple, deterministic tests and metrics, implemented through their open-source tool PromptFu. They faced unique challenges in preventing harmful content and jailbreaks, leading to innovative solutions in red teaming and risk assessment, while maintaining a balance between casual user interaction and safety constraints.
Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.
Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.