Clone the repository and move to the corresponding directory by running:
git clone git@github.com:zilaeric/othello-gpt-probing.git
cd othello-gpt-probing
Install dependencies and download the necessary data (takes ~15 minutes) by running:
sbatch jobs/install_env.job
Tranform data to the neccessary format (takes ~10 minutes) by running:
sbatch jobs/transform_data.job
Login to your WandB by activating the environment and inserting the requested information when running:
module purge
module load 2021
module load Anaconda3/2021.05
source activate othello
wandb login
Finally, a probe with a ready-made .job
file can be trained, for example, by running:
sbatch jobs/probe_6_hook_resid_post.job
In order to define and train a new probe, arguments --layer
and --place
must be set in line with the TransformerLens library definitions (arguments are subsequently combined as f"block.{layer}.{place}"
). An example of training a probe into the sixth (--layer 6
) self-attention block's final output (--place "hook_resid_post"
) follows:
python mechanistic_interpretability/tl_probing_v1.py --layer 6 --place "hook_resid_post"
Neel Nanda just released a TransformerLens version of Othello-GPT (Colab, Repo Notebook), boosting the mechanistic interpretability research of it. Based on his work, a tool was made to inspect each MLP neuron in Othello-GPT, e.g. see the differing activation for neuron 255 in layer 3 and neuron 250 in layer 8.
This repository provides the code for training, probing and intervening the Othello-GPT in Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, to be present at ICLR 2023.
The implementation is based on minGPT, thanks to Andrej Karpathy.
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
- Installation
- Training Othello-GPT
- Probing Othello-GPT
- Intervening Othello-GPT
- Attribution via Intervention Plots
- How to Cite
Some plotting functions require Latex on your machine: check this FAQ for how to install.
Then use these commands to set up:
conda env create -f environment.yml
conda activate othello
python -m ipykernel install --user --name othello --display-name "othello"
mkdir -p ckpts/battery_othello
Download the championship dataset and the synthetic dataset and save them in data
subfolder.
Then see train_gpt_othello.ipynb
for the training and validation. Alternatively, checkpoints can be downloaded from here to skip this step.
The default experiment setting requires jupyter nbconvert --execute --to notebook --allow-errors --ExecutePreprocessor.timeout=-1 train_gpt_othello.ipynb --inplace --output ckpts/checkpoint.ipynb
to run it in background.
Then we will use train_probe_othello.py
to train probes.
For example, if we want to train a nonlinear probe with hidden size python train_probe_othello.py --layer 6 --twolayer --mid_dim 64 --championship
.
Checkpoints will be saved to ckpts/battery_othello
or can be alternatively downloaded from here. What produces the these checkpoints are produce_probes.sh
.
See intervening_probe_interact_column.ipynb
for the intervention experiment, where we can customize (1) which model to intervene on, (2) the pre-intervention board state (3) which square(s) to intervene on.
See plot_attribution_via_intervention_othello.ipynb
for the attribution via intervention experiment, where we can also customize (1) which model to intervene on, (2) the pre-intervention board state (3) which square(s) to attribute.
@inproceedings{
li2023emergent,
title={Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task},
author={Kenneth Li and Aspen K Hopkins and David Bau and Fernanda Vi{\'e}gas and Hanspeter Pfister and Martin Wattenberg},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=DeG07_TcZvT}
}