ZenML

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting 2025
View original source

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

Industry

Tech

Technologies

This case study presents a comprehensive overview of production-scale AI security and prompt engineering challenges through the lens of Sandra Fulof’s work at HackAPrompt and LearnPrompting. The case demonstrates how educational platforms and competitive red teaming can address critical LLMOps challenges in model security, evaluation, and deployment.

Background and Problem Context

The case study emerges from the fundamental challenge of deploying large language models safely in production environments. Sandra Fulof, CEO of both HackAPrompt and LearnPrompting, identified early gaps in both educational resources for prompt engineering and systematic approaches to AI security testing. The work began with creating the first comprehensive guide on prompt engineering, which grew from a college English project to a resource used by millions worldwide and cited by major organizations including OpenAI, Google, BCG, and the US government.

LearnPrompting: Educational Platform for Production Prompt Engineering

LearnPrompting represents a significant LLMOps educational initiative that addressed the lack of systematic knowledge about prompt engineering in production environments. The platform trained millions of users globally on prompt engineering techniques, serving as the only external resource cited by Google’s official prompt engineering documentation. The platform collaborated with OpenAI to develop courses on ChatGPT and prompt engineering, demonstrating the critical need for structured approaches to prompt optimization in production systems.

The educational content covered advanced prompt engineering techniques essential for production deployments, including systematic taxonomies of prompting methods, evaluation frameworks, and best practices for different model architectures. The platform’s success indicates the widespread need for standardized prompt engineering knowledge in production LLM deployments.

The Prompt Report: Systematic Literature Review and Benchmarking

A crucial component of the case study involves “The Prompt Report,” described as the largest systematic literature review on prompting techniques. This work involved a team of 30 researchers from major labs and universities, spending 9-12 months analyzing approximately 200 prompting and agentic techniques, including 58 text-based English-only prompting techniques.

The report established taxonomies for prompt components (roles, examples, formats) and conducted both manual and automated benchmarks comparing different techniques. The systematic approach included:

HackAPrompt: First AI Red Teaming Competition Platform

The HackAPrompt platform represents a pioneering approach to systematic AI security testing in production environments. As the first prompt injection competition, it collected 600,000 prompts that became the standard dataset used by every major AI company for benchmarking and improving their models. The competition data is cited extensively by OpenAI, Google, and other major labs in their security research.

Key Technical Findings for Production Deployment

The competition revealed several critical insights for LLMOps practitioners:

Production Security Implications

The case study demonstrates that AI security fundamentally differs from traditional cybersecurity. While classical security vulnerabilities can be patched with binary success (either protected or not), AI security involves probabilistic defenses without 100% guarantees. This “jailbreak persistence hypothesis” suggests that production AI systems cannot be fully secured through traditional patching approaches.

The non-deterministic nature of LLMs compounds security challenges, as the same prompt can produce different outputs across runs, making consistent security evaluation difficult. This affects both attack success measurement and defense validation in production environments.

Practical Production Challenges and Solutions

Manual vs. Automated Prompt Engineering

The case study includes extensive experimentation comparing human prompt engineers against automated systems. In a mental health classification task (detecting suicidal ideation markers), manual prompt engineering over 20 hours achieved specific performance levels, but automated prompt engineering using DSP (a prompt optimization library) significantly outperformed human efforts. The combination of automated optimization plus human refinement achieved the best results, suggesting optimal LLMOps workflows should integrate both approaches.

Model-Specific Optimization Challenges

The case demonstrates that prompts optimized for one model often don’t transfer effectively to others. In red teaming experiments, only 40% of prompts that successfully attacked GPT-3.5 also worked against GPT-4, highlighting the need for model-specific optimization in production deployments. This creates significant challenges for organizations using multiple models or planning model upgrades.

Evaluation and Benchmarking Considerations

The case study reveals important limitations in current AI evaluation practices. Benchmark results depend heavily on prompting methodology, with factors like few-shot example ordering, label distribution, and output format significantly affecting performance. The case argues for more standardized evaluation protocols and warns against over-relying on benchmark comparisons without understanding the underlying prompt engineering choices.

Agent Security and Future Production Challenges

The case study concludes with analysis of agentic systems, arguing that current LLM security limitations make powerful production agents impractical. For agents to operate effectively in real-world environments (web browsing, physical robots, tool use), they must be robust against adversarial inputs they might encounter. The case provides examples of how simple prompt injection attacks could compromise agent behavior in production scenarios.

The speaker suggests that adversarial robustness must be solved before truly powerful agents can be safely deployed at scale. This represents a fundamental LLMOps challenge, as agent failures could cause direct financial harm to organizations rather than just reputational damage.

Data-Driven Security Improvement

The case study demonstrates how systematic data collection through red teaming competitions can improve production model security. The HackAPrompt dataset enables AI companies to understand attack patterns and develop more robust defenses. The approach suggests that continuous red teaming and data collection should be integral parts of LLMOps workflows.

Recent statements from OpenAI leadership suggest they believe they can achieve 95-99% mitigation of prompt injection through better data, supporting the case study’s emphasis on systematic attack data collection for improving production security.

Implications for LLMOps Practice

This case study provides several key insights for LLMOps practitioners:

The case study represents a foundational contribution to understanding AI security in production environments, providing both theoretical frameworks and practical tools that have become industry standards for LLMOps practitioners working with large language models in production settings.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49