Research Papers

Prompt Injection Detection and Prevention

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou1,2, Zifan Wang2, Nicholas Carlini3, Milad Nasr3, J. Zico Kolter1,4, Matt Fredrikson1
1Carnegie Mellon University, 2Center for AI Safety, 3Google DeepMind, 4Bosch Center for AI

GitHub Repository
The official repository containing code implementation, demo notebooks, and reproducible experiments for the paper. Includes a fast implementation called nanogcg of the GCG (Gradient-based Controlled Generation) algorithm for testing adversarial attacks.

Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?
Content Warning: This paper contains examples of harmful language.
Alexander Wei1, Nika Haghtalab1, Jacob Steinhardt1
1UC Berkeley

Note: This page will be regularly updated with new papers and resources as they become available.