This project implements autoencoders with graph neural networks using PyTorch Geometric for application in anomaly detection in particle collisions at the Large Hadron Collider.
An autoencoder trained to reconstruct data can be used to filter out background from potentially anomalous signals by cutting on the reconstruction loss. We can then analyze the filtered data using a bump hunt. More details can be found in this community submission of our preliminary results (DL link): GraphAutoencoderLHCO2020PaperContribution.pdf. Also section 3.7 in the community paper on arXiv.
Dataset comes from the LHC Olympics 2020 Anomaly Detection Challenge. Dataset and details can be found at the following:
Background Data and Black Boxes
Follow Nautilus instructions to get access to PRP under the cms-ml namespace and set up kubectl on your computer
Fork the repo and clone it locally
git clone [the URL of your fork]
In the directory where anomaly-pod.yml
is, run
kubectl -n cms-ml create -f anomaly-pod.yml
kubectl -n cms-ml exec -it anom-pod -- bash
This lets you access a pod (remote environment) where our data and models are stored.
To work on the code within the pod and edit with Vim or Emacs:
cd ~/work
git clone [the URL of your fork]
cd AnomalyDetection4Jets/code
If you want to use jupyter notebook, on your local machine do:
kubectl port-forward anomaly-pod 8888:8888
Then once in the pod run the following command in whichever directory you want to work in:
jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser
To see what's stored in our volume, do cd /anomalyvol/
and look around the directory. Important directories include:
/anomalyvol/experiments
/anomalyvol/data
/anomalyvol/emd_models
Once done using your pod terminate it with:
kubectl -n cms-ml delete pods anom-pod
- It's mandatory to read the Nautilus documentation for usage policies and other details
- There are 2 volumes where we store our data, you can see them if you run
kubect -n cms-ml get pvc
in your local environment.anomalyvol-2
is more up to date but if you need extra storage or if you want space to experiment more you can switch to usinganomalyvol
by changing any instances ofanomalyvol-2
toanomalyvol
in the .yml files.
------------------------
anomalyvol Bound pvc-a5cb2fae-e8e3-4c0d-b8ef-e69ff52f5aad 1000Gi RWX rook-cephfs 357d
anomalyvol-2 Bound pvc-6bb604aa-5a40-44d4-af1e-fb6f38b4d1fb 1000Gi RWX rook-cephfs 301d
...
- Pods only have a lifespan of 6hrs so save your work frequently by pushing to github. Alternatively edit the code on your local machine instead of through a pod if you don't need to run any commands. It can be convenient to generate a small sample of the dataset on your local machine to test your code after you set up all the packages needed locally.
- You can set the default namespace to be
cms-ml
so that you don't have to always add the-n cms-ml
flag to kubectl commands.
kubectl config set-context nautilus --namespace=cms-ml
-Any changes you save in the home directory of a pod will get deleted (~) once the pod terminates. Things saved in /anomalyvol/
will remain.
To know how to run the code and what all the flags do I recommend looking through the argparse section of the corresponding files. Below are some examples of how the commands will look.
In the volume you can already find the processed dataset in /anomalyvol/data
, but if you need to generate the data in the future:
Make sure you have a directory somewhere with the raw data in the raw/
directory. Look at /anomalyvol/data/bb_train_sets/bb0_xyz/
for reference. The raw data can be downloaded from the Zenodo page linked above, and sent to the volume using:
kubectl -n cms-ml cp events_LHCO2020_backgroundMC_Pythia.h5 cms-ml/anom-pod:/anomalyvol/data/bb_train_sets/your_directory/raw/events_LHCO2020_backgroundMC_Pythia.h5
Replace the name of the file or the path with whatever you're sending over.
To process a sample of the raw dataset do
python graph_data.py --dataset /anomalyvol/data/[examplepath] --n-proc 1 --n-events 1000 --bb 0 --n-events-merge 100
Usually creating the whole dataset takes a long time and memory so we'll use a job instead. You can check anomaly-graph-job.yml
for details on how that would look. Assuming all the parameters are correctly set and the dataset exists, run the job with:
kubectl -n cms-ml create -f anomaly-graph-job.yml
You can delete the job once done using:
kubectl -n cms-ml delete jobs anomaly-graph-job.yml
Notes:
- for bb0 (the background dataset we use for training) you can already find it preprocessed in
/anomalyvol/data/bb_train_sets/bb0_xyz/
- you do not want to generate the whole dataset on your local machine. Once processed it takes a lot of space.
- it is easy to use up all the volume's storage by generating datasets so be careful
python train_script.py --mod-name [REPLACE WITH A NAME] --input-dir /anomalyvol/data/bb_train_sets/bb0_xyz/ --box-num 0 --model EdgeNet --batch-size 16 --lr 0.01 --loss emd_loss --emd-model-name EmdNNRel.best.pth --num-data 256 --patience 10
You can find the saved model in /anomalyvol/experiments/
in the directory with the corresponding model name. Alternatively you can change the output path using the --output-dir
flag.
You won't have enough memory to train a model using the whole dataset in a pod, so set up a train job with the appropriate parameters. Check gae_train_job.yml
for an example.
If your pod or job has multiple gpus allocated to it, it will by default use multiple gpus for training.
If using the emd network as a loss check out /anomalyvol/emd_models/
for their names (I only recommend using the ones suffixed with Spl and Rel). If using multi-gpus to train the emd network, set --model
to EdgeNetEMD
.
Same idea as the prior sections, though you will almost always want to run a job. Look at bump_hunt.py
for the flags. By default you will find the generated graphs in /anomalyvol/experiments
.