Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou1,2, Zifan Wang2, Nicholas Carlini3, Milad Nasr3, J. Zico Kolter1,4, Matt Fredrikson1
1Carnegie Mellon University, 2Center for AI Safety, 3Google DeepMind, 4Bosch Center for AI
GitHub Repository
The official repository containing code implementation, demo notebooks, and reproducible experiments for the paper. Includes a fast implementation called nanogcg
of the GCG (Gradient-based Controlled Generation) algorithm for testing adversarial attacks.
Jailbroken: How Does LLM Safety Training Fail?
Content Warning: This paper contains examples of harmful language.
Alexander Wei1, Nika Haghtalab1, Jacob Steinhardt1
1UC Berkeley
Note: This page will be regularly updated with new papers and resources as they become available.