Research Papers

Prompt Injection Detection and Prevention

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou^1,2, Zifan Wang², Nicholas Carlini³, Milad Nasr³, J. Zico Kolter^1,4, Matt Fredrikson¹
¹Carnegie Mellon University, ²Center for AI Safety, ³Google DeepMind, ⁴Bosch Center for AI

GitHub Repository
The official repository containing code implementation, demo notebooks, and reproducible experiments for the paper. Includes a fast implementation called nanogcg of the GCG (Gradient-based Controlled Generation) algorithm for testing adversarial attacks.

Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?
Content Warning: This paper contains examples of harmful language.
Alexander Wei¹, Nika Haghtalab¹, Jacob Steinhardt¹
¹UC Berkeley

Note: This page will be regularly updated with new papers and resources as they become available.