This repository hosts Molecular Crystal Representation from Transformers (MCRT), a transformer-based model designed for property prediction of molecular crystals. Pre-trained on over 700,000 experimental structures from the Cambridge Crystallographic Data Centre (CCDC), MCRT extracts both local and global representations of crystals using multi-modal features, achieving state-of-the-art performance on various property prediction tasks with minimal fine-tuning. Explore this repository to accelerate your research in molecular crystal discovery and functionality prediction.
It would be easier to use Apptainer (install from here) because you don't have to deal with any unexpected errors when you install the environment. You only have to install apptainer, and we provided the pre-defined images here.
- Create a Conda environment with Python 3.8:
conda create -y -n MCRT python=3.8
conda activate MCRT
- Install PyTorch=2.1.1 (from here) and DGL (from here) based on your CUDA version and OS.
- Install other packages and MCRT:
cd /path/to/MCRT
pip install -r requirements.txt
pip install MCRT-tools
The persistence images are generated using adapted moleculetda. We also provide 2 options to install it.
Download the pre-defined image here. Usage:
apptainer exec /path/to/moleculetda_container.sif python3 /path/to/cif_to_image.py --cif_path /path/to/cif_path --paral 16
Create a Conda environment with Python 3.11:
conda create -y -n persistent python=3.11
conda activate persistent
pip install moleculetda tqdm numpy
Usage:
conda activate persistent
python /path/to/cif_to_image.py --cif_path ../cifs/your_cif_folder --paral 16
You can parallal the generation by setting --paral.
Pickles include the pre-calculated positional embedding matrix and pre-training labels. It's an optional procedure for finetuning now because we have implemented the generation of graphs in real time for finetuneing. But for pretraining, the label for tasks are time-consuming to generate, it should be generated like this:
conda activate MCRT
python /path/to/cif_to_dataset.py --cif_path /path/to/cif_path --paral 16 --type pretrain
You can generate pickle for fintuning too, which may be a little bit faster than generating them in real time. But it depends on your CPU and GPU, since the generation is on CPU, if your CPU is fast or GPU is slow, there would be no difference since the bottleneck is the model training on GPU. If you want to generate pickles for finetuning:
conda activate MCRT
python /path/to/cif_to_dataset.py --cif_path /path/to/cif_path --paral 16 --type finetune
The dataset split is defined by a json file named dataset_split.json:
{
"train": ["SUYYIV","UYUGED"],
"val": ["GASVUR","IHOZAH"],
"test": ["LUMSER","DUGXUY"],
}
One can generate it by yourself or by using split_dataset.py which we provided.
python /path/to/split_dataset.py --cif /path/to/cif_path --split 0.8 0.1 0.1
When you finished the generation above, you should make sure the dataset structure is like this:
your_dataset/
βββ cifs/containing cif files
βββ imgs/containing persistence images
βββ pickles/(optional for finetuning) containing pickles
βββ dataset_split.json
βββ downstream.csv
You can download pre-trained MCRT and finetuned models in the paper via figshare here
import MCRT
import os
__root_dir__ = os.path.dirname(__file__)
root_dataset = os.path.join(__root_dir__,"cifs","your_dataset")
log_dir = './logs/your_dataset_logs'
downstream = "downstream" # name of downstream.csv
loss_names = {"classification": 0,"regression": 1,} # for regression
max_epochs = 50 # training epochs
batch_size = 32 # desired batch size; for gradient accumulation
per_gpu_batchsize = 8 # batch size per step
num_workers = 12 # num of CPU workers
mean = 0 # mean value of your dataset
std = 1 # standard deviation of your dataset
test_to_csv = True # if True, save test set results
load_path = "/path/to/pretrained.ckpt"
if __name__ == '__main__':
MCRT.run(root_dataset, downstream,log_dir=log_dir,\
max_epochs=max_epochs,\
loss_names=loss_names,\
batch_size=batch_size,\
per_gpu_batchsize=per_gpu_batchsize,\
num_workers = num_workers,\
load_path =load_path ,\
test_to_csv = test_to_csv,\
mean=mean, std=std )
Usage: make a python file named finetune.py and run it:
- With Apptainer:
apptainer exec /path/to/MCRT_container.sif python /path/to/finetune.py
- Directly run
conda activate MCRT
python /path/to/finetune.py
Set test_only as True, also set test_to_csv to True if you want to save the test results
import MCRT
import os
__root_dir__ = os.path.dirname(__file__)
root_dataset = os.path.join(__root_dir__,"cifs","your_dataset")
log_dir = './logs/your_dataset_logs'
downstream = "downstream" # name of downstream.csv
loss_names = {"classification": 0,"regression": 1,} # for regression
max_epochs = 50 # training epochs
batch_size = 32 # desired batch size; for gradient accumulation
per_gpu_batchsize = 8 # batch size per step
num_workers = 12 # num of CPU workers
mean = 0
std = 1
test_only=True # test the model
test_to_csv = True # if True, save test set results
load_path = "/path/to/finetuned.ckpt"
if __name__ == '__main__':
MCRT.run(root_dataset, downstream,log_dir=log_dir,\
max_epochs=max_epochs,\
loss_names=loss_names,\
batch_size=batch_size,\
per_gpu_batchsize=per_gpu_batchsize,\
num_workers = num_workers,\
load_path =load_path ,\
test_only =test_only ,\
test_to_csv = test_to_csv,\
mean=mean, std=std )
Usage: make a python file named test_model.py and run it:
- With Apptainer:
apptainer exec /path/to/MCRT_container.sif python /path/to/test_model.py
- Directly run
conda activate MCRT
python /path/to/test_model.py
MCRT takes atomic graph (local) and persistence image patches (global) as input, the model structure is shown below: The attention score on each atom and patch can be visualized as below:
from MCRT.visualize import PatchVisualizer
import os
__root_dir__ = os.path.dirname(__file__)
model_path = "path/to/finetuned model"
data_path = "path/to/dataset containing the crystal" # have to prepare pickles
cifname = 'crystal name' # make sure it's in the test split, and its pickle exists
vis = PatchVisualizer.from_cifname(cifname, model_path, data_path,save_heatmap=True)
vis.draw_graph()
vis.draw_image_1d(top_n=10)
vis.draw_image_2d(top_n=10)
Usage: make a python file named visual.py and run it:
- With Apptainer:
apptainer exec /path/to/MCRT_container.sif python /path/to/visual.py
- Directly run
conda activate MCRT
python /path/to/visual.py
This repo is built upon the previous work MOFTransformer's codebase. Thank you very much for the excellent codebase.