sae_jailbreak_unlearning

Investigating how well intervening on Sparse Autoencoder internals prevents adversaries from accessing dangerous knowledge.

This is the code behind the paper Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning by Matthew Khoriaty, Andrii Shportko, Gustavo Mercier, and Zach Wood-Doughty of Northwestern University.

Folder structure based on the one described in this website

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data		data
models		models
results		results
src		src
steer_dfs_final_7		steer_dfs_final_7
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
=4.2.0		=4.2.0
README.md		README.md
Slivka Abstract.docx		Slivka Abstract.docx
Slivka Abstract.pdf		Slivka Abstract.pdf
experiments.csv		experiments.csv
req.txt		req.txt
setup.py		setup.py
tests.ipynb		tests.ipynb

Provide feedback