This repository contains the code used to generate the results reported in the paper: A Systematic Assessment of Deep Learning Models for Molecule Generation.
@article{rigoni2020systematic,
title={A Systematic Assessment of Deep Learning Models for Molecule Generation},
author={Rigoni, Davide and Navarin, Nicol{\`o} and Sperduti, Alessandro},
journal={arXiv preprint arXiv:2008.09168},
year={2020}
}
The folders whose names begin with the "_" character contain all the code and useful files to test all the models.
The others are the folders that contain the original models. For more information about each model, you need to view the READMI.md
file within each folder.
This project uses the conda
environment.
In the _environment
folder you can find, for each model, the .yml
file for the configuration of the conda
environment and also the .txt
files for the pip
environment.
Note that some versions of the dependencies can generate problems in the configuration of the environment. For this reason, although the setup.bash
file is present for the configuration of each project, it is better to configure them manually.
NOTE: some environments could be set only to use CPU. In this case if you want to use the GPU you need to change the tensorflow
line in the environment file with tensorflow-gpu
.
Depending on the model, some lines of code must also be changed.
The project is structured as follows:
_analysis
: contains the code to execute to test the molecules generated by the models. It also contains the code to analyze the datasets;_datasets
: contains the datasets QM9 and ZINC;_environments
: contains the filesetup.bash
used to configure each environment;_utils
: contains all the utility code;gVAE
: contains both Character VAE and Grammar VAE code;sdVAE
: contains the Syntax Directed VAE code;molGAN
: contains the MolGAN code;rGVAE
: contains the Regularized Graph VAE code;jtVAE
: contains the Junction Tree VAE code;constrainedGVAE
: contains the Constrained Graph VAE code.
First you need to download the necessary files by running the following commands:
cd _dataset/QM9
sh download_dataset.sh
The test set is formed, for both data sets, by the first 5000 molecules. Since each model can use a different validation procedure, the choice of how to divide the remaining molecules into validations and training sets is left to the model in question, according to the code used by the author of the model.
For training and molecule generation it is necessary to execute the model code in the appropriate folders.
For new models, remember to add the reading and saving functions of the moelcules accordingly to the implementation reported in the currently present models.
Within each model folder there is a README.md
file that link the original repository of the code.
Refer to the original repository for the commands to be used to train the models.
Once the molecules have been generated with a model and saved in the molecules.txt
file, you can use the files in the _analysis/models
folder to calculate the they're properties.
File model_results_generation.py
analyzes the molecules generated by sampling from the laten space ne wmolecules, while model_results_bias.py
performs the reconstruction analysis.
Example given $path
the full path to the ComparisonsDGM
folder, and $my_folder
the name of the folder where to save the results:
conda activate analysis
cd _analysis/models
python model_results_bias.py $my_folder $path/gVAE/results/qm9_vae_str_L56_E100_val_decRes.txt qm9
python model_results_bias.py $my_folder $path/gVAE/results/zinc_vae_str_L56_E100_val_decRes.txt zinc
The results will be reported in folder $path/_analysis/models/[bias, generation]/$my_folder/
.
For any questions and comments, contact Davide Rigoni.
MIT