Multimodal Pretraining for Unsupervised Protein Representation Learning

Published at journal Biology Methods & Protocols (Oxford University Press):

https://doi.org/10.1093/biomethods/bpae043

Preprint:

https://doi.org/10.1101/2023.11.29.569288

Contributors

Viet Thanh Duy Nguyen
Truong Son Hy (Correspondent / PI)

Environment Setup

git clone https://github.com/HySonLab/Protein_Pretrain.git
cd Protein_Pretrain
conda env create -f environment.yml
conda activate MPRL

Data Downloading

For pretraining:
- Create /pretrain/data/swissprot directory.
- Download the data at Swiss Prot.
- Move the downloaded files into the /pretrain/data/swissprot directory and extract it.
For downstream tasks:
- Create downstreamtasks/data/{dataset_name} directory, where dataset_name is in {Atom3D_MSP, DAVIS, KIBA, PDBbind, SCOPe1.75, D&D}.
- Download the data at: Atom3D_MSP, DAVIS, KIBA, PDBbind, SCOPe1.75, D&D.
- Move the downloaded files into the downstreamtasks/data/{dataset_name} directory and extract it.

Data Preprocessing

For pretraining, run following commands:

python ./pretrain/data/{model_name}.py

Replace {model_name} with the specific model identifier:

VGAE: Variational Graph Autoencoder
PAE: PointNet Autoencoder
Auto-Fusion

For downstream tasks, run following commands:

python ./downstreamtasks/data/{task_name}.py

Replace {task_name} with the specific task identifier:

PLA: Protein-ligand Binding Affinity
PFC: Protein Fold Classification
EI: Enzyme Identification
MSP: Mutation Stability Prediction

Pretraining

You have two options for using our models for pretraining: training from scratch or using our pre-trained model checkpoints.

Training from Scratch

If you prefer to train the model yourself, run the following commands:

python ./pretrain/{model_name}.py --mode your_mode

Command-line Arguments
- --mode: Select the mode (train or test).

Using Pre-trained Model Checkpoints

For those who want to bypass the training phase, we provide pretrained model checkpoints that you can use directly in your projects. Download and integrate the pretrained model checkpoints from our shared drive: checkpoints

Downstream Tasks

Run following commands:

python ./downstreamtasks/{task_name}.py --mode your_mode --modal your_modal

Command-line Arguments
- --modal: Select the modality (sequence, graph, point_cloud or multimodal).
- --mode: Select the mode (train or test).
- --test_dataset (Only available for PFC task): Select the test dataset for testing (test_family, test_fold, or test_superfamily).
- --dataset (Only available for PLA task): Select the dataset (DAVIS, KIBA or PDBBind)

Please cite our paper!

@article{10.1093/biomethods/bpae043,
    author = {Duy Nguyen, Viet Thanh and Son Hy, Truong},
    title = "{Multimodal pretraining for unsupervised protein representation learning}",
    journal = {Biology Methods and Protocols},
    pages = {bpae043},
    year = {2024},
    month = {06},
    abstract = "{Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures and functions is vital for progress in these fields. To capture this complexity, we introduce MPRL—Multimodal Protein Representation Learning, a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein\_Pretrain.}",
    issn = {2396-8923},
    doi = {10.1093/biomethods/bpae043},
    url = {https://doi.org/10.1093/biomethods/bpae043},
    eprint = {https://academic.oup.com/biomethods/advance-article-pdf/doi/10.1093/biomethods/bpae043/58272243/bpae043.pdf},
}

@article{Nguyen2023.11.29.569288,
	author = {Viet Thanh Duy Nguyen and Truong Son Hy},
	title = {Multimodal Pretraining for Unsupervised Protein Representation Learning},
	elocation-id = {2023.11.29.569288},
	year = {2023},
	doi = {10.1101/2023.11.29.569288},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {In this paper, we introduce a framework of symmetry-preserving multimodal pretraining to learn a unified representation on proteins in an unsupervised manner that can take into account primary and tertiary structures. For each structure, we propose the corresponding pretraining method on sequence, graph and 3D point clouds based on large language models and generative models. We present a novel way to combining representations from multiple sources of information into a single global representation for proteins. We carefully analyze the performance of our framework in the pretraining tasks. For the fine-tuning tasks, our experiments have shown that our new multimodal representation can achieve competitive results in protein-ligand binding affinity prediction, protein fold classification, enzyme identification and mutation stability prediction. We expect that this work will accelerate future research in proteins. Our source code in PyTorch deep learning framework is publicly available at https://github.com/HySonLab/Protein_PretrainCompeting Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2023/12/02/2023.11.29.569288},
	eprint = {https://www.biorxiv.org/content/early/2023/12/02/2023.11.29.569288.full.pdf},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
downstreamtasks		downstreamtasks
figures		figures
model		model
pretrain		pretrain
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Pretraining for Unsupervised Protein Representation Learning

Contributors

Environment Setup

Data Downloading

Data Preprocessing

Pretraining

Training from Scratch

Using Pre-trained Model Checkpoints

Downstream Tasks

Please cite our paper!

About

Releases

Packages

Contributors 2

Languages

HySonLab/Protein_Pretrain

Folders and files

Latest commit

History

Repository files navigation

Multimodal Pretraining for Unsupervised Protein Representation Learning

Contributors

Environment Setup

Data Downloading

Data Preprocessing

Pretraining

Training from Scratch

Using Pre-trained Model Checkpoints

Downstream Tasks

Please cite our paper!

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages