Code and documentation template for PyTorch research projects. This repository is intended to be cloned at the beginning of any new research deep learning project based on PyTorch.
Every TODO comment in the code indicates a portion of the code that should be adapted for every specific project. The rest of the code should usually remain almost unchanged.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
The project folder, including also files excluded from git versioning, has the following structure:
pytorch-template/ [main folder]
│ .gitignore [files ignored by git]
│ LICENSE [code license]
│ README.md [this file]
│ requirements.txt [package dependencies]
│ setup.py [package setup script]
│
├───data [data folder excluded from git tracking]
│ │ targets.csv [targets for train, dev and test data]
│ │
│ ├───train
│ │ ...
│ ├───dev
│ │ ...
│ └───test
│ ...
│
├───docs [documentation folder]
│ ...
│
└───pytorch_template [package source code folder]
__main__.py [package command-line interface]
...
You should comply to this structure in all your projects,
in particular you should structure the data
folder containing your dataset
according to the hierarchy shown.
In order to run the code you need to have Python 3.6 installed.
You can install the package on MacOS/Linux with the following commands:
git clone https://github.com/nextbitlabs/pytorch-template.git
cd pytorch-template
python3 setup.py sdist
python3 setup.py bdist_wheel
pip3 install --no-index --find-links=dist pytorch_template -r requirements.txt
Here data are synthetic so, in order to generate them run:
python3 generate_data.py
A command line interface is available to easily interact with the package.
It is included inside the package in the file __main__.py
.
To see more details about the command line interface it is possible to show the help page using the command:
python3 -m pytorch_template --help
The available commands are:
ingest
: preprocess raw data and export it in a suitable format for model training;train
: train the deep learning model on ingested data;restore
: restore the training from a saved checkpoint;eval
: evaluate the model on ingested validation data;test
: produce model output on a single raw data sample.
Every command has its separate help page that can be visualized with
python3 -m pytorch_template <command> --help
The ingestion phase is useful if preprocessing is computationally expensive and many transformations are required. Here, for example, it is not really necessary but it is included to show the code structure.
In some cases an additional safe-ingest
can be used to check and assure labels
coherence among the different dataset splits or to perform transformations
that depend on other splits. Here it is not needed because the
set of labels is not fixed since the example task is a regression.
Only the training set and the development set have to be ingested and that can be do with the following lines:
python3 -m pytorch_template ingest data train
python3 -m pytorch_template ingest data dev
For more details on the usage you can access the help page with the command
python3 -m pytorch_template ingest --help
The training phase has always the same structure and the template is built to keep all the tried models in files separated from the main training function.
The path to the best weight checkpoint according to the metric is printed to console at the end of the computation.
The command has many optional training-related parameters commonly tuned by the
experimenter, like batch-size
, epochs
, lr
.
The most basic training can be performed specifying just the directory containing
the dataset, already split in train
(compulsory) and dev
(optional) folders
using the default values for the other parameters.
python3 -m pytorch_template train data/tensors
An equivalent form of the previous command with all the default values manually specified is:
python3 -m pytorch_template train \
data/tensors \
--output-dir . \
--batch-size 20 \
--epochs 40 \
--lr 0.1
For more details on the usage you can access the help page with the command
python3 -m pytorch_template train --help
When the model has not converged at the end of the training phase, it can be useful to restore it from the last saved checkpoint and that is exactly the role of this command.
The command has the same optional parameters of the train
command.
It just has an additional compulsory parameter: the path to the checkpoint model
to be restored.
The most basic restored training can be performed specifying just the directory
containing the dataset, already split in train
(compulsory) and dev
(optional)
folders, and the checkpoint path using the default values for the other parameters.
python3 -m pytorch_template restore \
runs/<secfromepochs>/checkpoints/model-<epoch>-<metric>.ckpt data/tensors
An equivalent form of the previous command with all the default values manually specified is:
python3 -m pytorch_template restore \
runs/<secfromepochs>/checkpoints/model-<epoch>-<metric>.ckpt \
data/tensors \
--output-dir . \
--batch-size 20 \
--epochs 40 \
--lr 0.1
For more details on the usage you can access the help page with the command
python3 -m pytorch_template restore --help
The eval
command reproduces the validation performed at the end of every epoch during the training phase.
It is particularly useful when many datasets are available to evaluate the transfer learning performances.
The evaluation can be performed specifying just the model checkpoint
to be evaluated and the directory containing the dataset, provided of a dev
sub-folders.
The batch size of evaluation batches can be manually specified otherwise its default
value is 20.
A full call to the command is:
python3 -m pytorch_template eval \
runs/<secfromepochs>/checkpoints/model-<epoch>-<metric>.ckpt \
data/tensors \
--batch-size 20
For more details on the usage you can access the help page with the command
python3 -m pytorch_template eval --help
The test
command preforms the inference on a single file.
The test of the model is performed specifying the model checkpoint to be evaluated and the path to a sample, for example:
python3 -m pytorch_template test \
runs/<secfromepochs>/checkpoints/model-<epoch>-<metric>.ckpt \
data/test/<sample>.pt
For more details on the usage you can access the help page with the command
python3 -m pytorch_template test --help
The model converges to perfect predictions using default parameters.
The template can be deployed on an NGC optimized instance, here we list the steps necessary to configure it on a AWS EC2 g4dn.xlarge instance on the NVIDIA Deep Learning AMI environment.
- Log in via ssh following the instructions on the EC2 Management Dashboard.
- Clone the repo
pytorch-template
in the home directory. - Download the most update PyTorch container running
docker pull nvcr.io/nvidia/pytorch:YY.MM-py3
(replace “YY” and “MM” with the most recent version updated monthly) - Create a container with
docker run --gpus all --name template -e HOME=$HOME -e USER=$USER \
-v $HOME:$HOME -p 6006:6006 --shm-size 60G -it nvcr.io/nvidia/pytorch:YY.MM-py3
At the end of the procedure you will gain access to a terminal on a Docker container configured to work on the GPU and you could simply run the commands above leveraging the speed of parallel computing.
The $HOME
directory on the Docker container is linked to the $HOME
directory
of the host machine, so the repository can be found in the $HOME
, similarly the
port 6006 used by TensorBoard is remapped from the container to the port 6006
of the host machine.
Useful commands to interact with the Docker container are:
docker start template
: start the container;docker exec -it template bash
: open a terminal on the container;docker stop template
: stop the container;docker rm template
: remove the container.
In order to monitor training you can run the following commands from the container console:
watch -n 1 nvidia-smi
to monitor GPU usage;tensorboard --logdir runs/<run_id> --bind_all
to start Tensorboard.
The template also includes an implementation of a cool new optimizer, Ranger. Ranger uses the Lookahead optimization method together with the RAdam optimizer. It is not used here, as it is way too slow for such a simple model, but it reportedly performs better than other Adam variants on deeper models.
You can use it simply by calling:
from pytorch_template.models.optimizer.ranger import Ranger
optimizer = Ranger(module.parameters())
If you want, you can specify many more hyper-parameters. If you use a learning rate scheduler, you should make sure that the learning rate remains constant for a rather long time, in order to let RAdam start correctly and to take advantage of LookAhead exploration.
An implementation of the Mish activation function is also included. Mish seems to perform slightly better than ReLu when training deep models, and works well in conjunction with Ranger.
To use Mish, you just need to call it:
from pytorch_template.models.mish import Mish
mish = Mish()
This project is licensed under Apache License 2.0, see the LICENSE file for details.