UNHSAILLab / LM-exp-logit-lens Public

forked from nrimsky/LM-exp

Notifications You must be signed in to change notification settings
Fork 0
Star 0

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

0 stars 23 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
data_generation		data_generation
datasets		datasets
intermediate_decoding		intermediate_decoding
probability_calibration		probability_calibration
refusal		refusal
steering		steering
sycophancy		sycophancy
unlearning		unlearning
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

`/refusal`

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

Red-teaming language models via activation engineering

`/sycophancy`

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

`/steering`

Activation addition experiments (pure act-adds from single forward passes)

`/intermediate_decoding`

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Decoding intermediate activations in llama-2-7b

Other directories

`/data_generation`

Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs

`/probability_calibration`

Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction

`/unlearning`

Early stage attempt at Google's Machine Unlearning Challenge

About

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook 97.7%
Python 2.3%