Adversarial datasets to red team Generative AI models

Learn about different datasets that can be used against generative AI models to test their safety and security.

Feb 20, 2025

One of the primary concerns, while building a Generative AI model, is to prevent users from using it for malicious purposes. Such malicious use may include:

leveraging the model to write malicious software, such as ransomware.
poisoning the model to generate obscene or racial response.
tricking the model into displaying dangerous information such as steps to create a bomb or a drug.
coercing the model into generating fake responses that appear legitimate (i.e. hallucinations).

To counter this, companies leverage pre-built datasets to test and refine their models. A red team can also leverage these models to identify vulnerabilities such as mishandling of toxic information, bypassing safety measures etc.

In this post, I discuss four such models. All of these models have been leveraged in real-world Generative AI red team exercises.

Wikipedia Toxic Comments (WTC)

The Wikipedia Toxic Comments (WTC) dataset originates from real-world discussions on Wikipedia talk pages and is designed to identify and mitigate harmful language online. This dataset comprises thousands of user comments annotated for various types of toxicity, such as insults, severe toxicity, profanity, threats, and identity-based hate. Its detailed annotations make it an invaluable resource for training systems that can detect and filter out toxic content

Build-it Break-it Fix-it (BBF)

In this dataset, developers first build models with safety or robustness guarantees, adversaries then attempt to “break” these models by introducing crafted inputs that expose weaknesses, and finally, developers work to fix these vulnerabilities based on the observed failures. The BBF dataset collects the adversarial examples produced during the break-it phase, offering a structured insight into the systematic failure modes of various systems. This iterative cycle encourages continuous improvement by revealing the specific conditions under which a model might falter.

Bot-Adversarial Dialogue (BAD)

The Bot-Adversarial Dialogue (BAD) dataset focuses on conversational vulnerabilities by capturing interactions where human adversaries engage with chatbots using carefully crafted prompts. These adversarial dialogues are designed to induce unsafe, biased, or otherwise undesirable responses from the system. By documenting both the provocative inputs and the chatbot's outputs, BAD provides a detailed overview of how dialogue systems can be manipulated in real-world conversational scenarios, highlighting areas where safety measures may need reinforcement.

Anthropic Red-Team Attempts (ANT-Red)

The Anthropic Red-Team Attempts (ANT-Red) dataset is a collection of adversarial inputs generated by expert red teamers with an in-depth understanding of model architectures. These prompts are meticulously designed to probe the subtle and known vulnerabilities of generative models. Enhanced with rich metadata detailing the strategy behind each attack, ANT-Red enables a systematic evaluation of a model’s responses, thereby guiding targeted improvements in its safety protocols and robustness measures.

Each of these datasets can be used to uncover specific vulnerabilities within generative AI models. For example,

The Wikipedia Toxic Comments dataset helps identify issues related to the handling of toxic and hateful language, ensuring that models do not inadvertently propagate harmful narratives.
The Build-it Break-it Fix-it dataset exposes systematic weaknesses by challenging the model with adversarial examples that mimic potential real-world attacks.
The Bot-Adversarial Dialogue dataset exposes vulnerabilities in conversational contexts, revealing how dialogue systems might be coaxed into producing biased or unsafe outputs.
The Anthropic Red-Team Attempts dataset, with its expert-crafted prompts, digs deeper into subtle model weaknesses that standard testing might overlook.

Red Team Notes
- A red team can use datasets such as Wikipedia Toxic Comments, Build-it Break-it Fix-it, Bot Adversarial Dialog and Anthropic Red Team Attempts to test a Generative AI model and uncover vulnerabilities such as improper handling of toxic language, coercing the module to produce biased or unsafe responses etc.

Follow my journey of 100 Days of Red Team on WhatsApp, Telegram or Discord.

References

100 Days of Red Team

Adversarial datasets to red team Generative AI models

Learn about different datasets that can be used against generative AI models to test their safety and security.

Discussion about this post