Red Teaming Language Models with Language Models

Key observations from the research paper Red Teaming Language Models with Language Models published by Google DeepMind and NYU.

Feb 21, 2025

Imagine you're testing a new security system for a high-tech building. Traditionally, you'd hire human testers to try breaking in by picking locks, bypassing alarms, or sneaking past security guards. But no matter how skilled these testers are, they can only try so many tricks before running out of ideas.

Now, instead of relying only on human testers, what if you built a robot that could think like an attacker? This robot could automatically perform thousands of break-in attempts, testing every possible weakness in the system much faster than a human ever could. It could try picking locks, scaling walls, or even tricking the guards with fake IDs—all at a massive scale.

Follow my journey of 100 Days of Red Team on WhatsApp, Telegram or Discord.

This is exactly what the researchers in this research did. The research paper can be accessed here.

Instead of relying on human testers to manually come up with tricky questions, the researchers used another AI model to generate these test cases. This automated approach helped them uncover thousands of potential failures in large-scale language models, such as offensive responses, privacy violations, and bias. By using different strategies—like prompting AI in specific ways or using reinforcement learning—the team was able to generate a wide range of challenging test inputs to stress-test the model.

The research team applied this method to an AI model called 280B parameter Dialogue-Prompted Gopher and found some surprising results:

The AI-generated test cases were often more diverse and insightful than those created by human testers.
The model sometimes shared personal information, like phone numbers and email addresses, when prompted in the right way.
The team also found that certain groups of people were more likely to be targeted with offensive language, revealing underlying biases in the AI’s responses.
The research team also noticed that conversations with the model could escalate in a harmful direction over time. The team discovered patterns where the AI would become more offensive or misleading the longer the exchange went on.
Leveraging a Language Model to red team a Language Model proved to be a better approach as it can test thousands of different scenarios much faster, covering a wider range of potential failures.

While this research was specific to red teaming Language Models in an automated manner, it highlights the overall trend of leveraging LLMs for various red team use cases.

Red Team Notes
- The paper "Red Teaming Language Models with Language Models" explored using AI to red team AI by leveraging one language model to generate adversarial test cases for another. 
- You can read this paper here.

Follow my journey of 100 Days of Red Team on WhatsApp, Telegram or Discord.

References