Skip to content
/ MCRT Public

Molecular Crystal Representation from Transformer

License

Notifications You must be signed in to change notification settings

fmggggg/MCRT

Repository files navigation

πŸ”₯ MCRT: A Universal Foundation Model for Transfer Learning in Molecular Crystals πŸš€

This repository hosts Molecular Crystal Representation from Transformers (MCRT), a transformer-based model designed for property prediction of molecular crystals. Pre-trained on over 700,000 experimental structures from the Cambridge Crystallographic Data Centre (CCDC), MCRT extracts both local and global representations of crystals using multi-modal features, achieving state-of-the-art performance on various property prediction tasks with minimal fine-tuning. Explore this repository to accelerate your research in molecular crystal discovery and functionality prediction. alt text

Option 1: Via Apptainer (easier and faster) πŸš€

It would be easier to use Apptainer (install from here) because you don't have to deal with any unexpected errors when you install the environment. You only have to install apptainer, and we provided the pre-defined images here.

Option 2: Directly install

  1. Create a Conda environment with Python 3.8:
conda create -y -n MCRT python=3.8
conda activate MCRT
  1. Install PyTorch=2.1.1 (from here) and DGL (from here) based on your CUDA version and OS.
  2. Install other packages and MCRT:
cd /path/to/MCRT
pip install -r requirements.txt
pip install MCRT-tools

Prepare persistence images

The persistence images are generated using adapted moleculetda. We also provide 2 options to install it.

Option 1: Via Apptainer (easier and faster) πŸš€

Download the pre-defined image here. Usage:

apptainer exec /path/to/moleculetda_container.sif python3 /path/to/cif_to_image.py --cif_path /path/to/cif_path --paral 16

Option 2: Directly install

Create a Conda environment with Python 3.11:

conda create -y -n persistent python=3.11
conda activate persistent
pip install moleculetda tqdm numpy

Usage:

conda activate persistent
python /path/to/cif_to_image.py --cif_path ../cifs/your_cif_folder --paral 16

You can parallal the generation by setting --paral.

Prepare pickles (optional)

Pickles include the pre-calculated positional embedding matrix and pre-training labels. It's an optional procedure for finetuning now because we have implemented the generation of graphs in real time for finetuneing. But for pretraining, the label for tasks are time-consuming to generate, it should be generated like this:

conda activate MCRT
python /path/to/cif_to_dataset.py --cif_path /path/to/cif_path --paral 16 --type pretrain 

You can generate pickle for fintuning too, which may be a little bit faster than generating them in real time. But it depends on your CPU and GPU, since the generation is on CPU, if your CPU is fast or GPU is slow, there would be no difference since the bottleneck is the model training on GPU. If you want to generate pickles for finetuning:

conda activate MCRT
python /path/to/cif_to_dataset.py --cif_path /path/to/cif_path --paral 16 --type finetune 

dataset split

The dataset split is defined by a json file named dataset_split.json:

{
  "train": ["SUYYIV","UYUGED"],
  "val": ["GASVUR","IHOZAH"],
  "test": ["LUMSER","DUGXUY"],
}

One can generate it by yourself or by using split_dataset.py which we provided.

python /path/to/split_dataset.py --cif /path/to/cif_path --split 0.8 0.1 0.1

dataset structure

When you finished the generation above, you should make sure the dataset structure is like this:

your_dataset/
β”œβ”€β”€ cifs/containing cif files
β”œβ”€β”€ imgs/containing persistence images
β”œβ”€β”€ pickles/(optional for finetuning) containing pickles
β”œβ”€β”€ dataset_split.json
└── downstream.csv

You can download pre-trained MCRT and finetuned models in the paper via figshare here

import MCRT
import os

__root_dir__ = os.path.dirname(__file__)
root_dataset = os.path.join(__root_dir__,"cifs","your_dataset")
log_dir = './logs/your_dataset_logs'
downstream = "downstream" # name of downstream.csv

loss_names = {"classification": 0,"regression": 1,} # for regression
max_epochs = 50 # training epochs
batch_size = 32  # desired batch size; for gradient accumulation
per_gpu_batchsize = 8 # batch size per step
num_workers = 12 # num of CPU workers
mean = 0 # mean value of your dataset
std = 1 # standard deviation of your dataset

test_to_csv = True # if True, save test set results
load_path  = "/path/to/pretrained.ckpt" 

if __name__ == '__main__':
    MCRT.run(root_dataset, downstream,log_dir=log_dir,\
             max_epochs=max_epochs,\
             loss_names=loss_names,\
             batch_size=batch_size,\
             per_gpu_batchsize=per_gpu_batchsize,\
             num_workers = num_workers,\
             load_path =load_path ,\
             test_to_csv = test_to_csv,\
             mean=mean, std=std )

Usage: make a python file named finetune.py and run it:

  1. With Apptainer:
apptainer exec /path/to/MCRT_container.sif python /path/to/finetune.py
  1. Directly run
conda activate MCRT
python /path/to/finetune.py

Set test_only as True, also set test_to_csv to True if you want to save the test results

import MCRT
import os

__root_dir__ = os.path.dirname(__file__)
root_dataset = os.path.join(__root_dir__,"cifs","your_dataset")
log_dir = './logs/your_dataset_logs'
downstream = "downstream" # name of downstream.csv

loss_names = {"classification": 0,"regression": 1,} # for regression
max_epochs = 50 # training epochs
batch_size = 32  # desired batch size; for gradient accumulation
per_gpu_batchsize = 8 # batch size per step
num_workers = 12 # num of CPU workers
mean = 0
std = 1

test_only=True # test the model
test_to_csv = True # if True, save test set results
load_path  = "/path/to/finetuned.ckpt" 

if __name__ == '__main__':
    MCRT.run(root_dataset, downstream,log_dir=log_dir,\
             max_epochs=max_epochs,\
             loss_names=loss_names,\
             batch_size=batch_size,\
             per_gpu_batchsize=per_gpu_batchsize,\
             num_workers = num_workers,\
             load_path =load_path ,\
             test_only =test_only ,\
             test_to_csv = test_to_csv,\
             mean=mean, std=std )

Usage: make a python file named test_model.py and run it:

  1. With Apptainer:
apptainer exec /path/to/MCRT_container.sif python /path/to/test_model.py
  1. Directly run
conda activate MCRT
python /path/to/test_model.py

MCRT takes atomic graph (local) and persistence image patches (global) as input, the model structure is shown below: alt text The attention score on each atom and patch can be visualized as below:

from MCRT.visualize import PatchVisualizer
import os
__root_dir__ = os.path.dirname(__file__)
model_path = "path/to/finetuned model"
data_path = "path/to/dataset containing the crystal" # have to prepare pickles
cifname = 'crystal name' # make sure it's in the test split, and its pickle exists

vis = PatchVisualizer.from_cifname(cifname, model_path, data_path,save_heatmap=True)
vis.draw_graph()
vis.draw_image_1d(top_n=10)
vis.draw_image_2d(top_n=10)

Usage: make a python file named visual.py and run it:

  1. With Apptainer:
apptainer exec /path/to/MCRT_container.sif python /path/to/visual.py
  1. Directly run
conda activate MCRT
python /path/to/visual.py
Atomic attention 1D persistence image attention 2D persistence Atomic attention

This repo is built upon the previous work MOFTransformer's codebase. Thank you very much for the excellent codebase.

About

Molecular Crystal Representation from Transformer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published