Published at journal Biology Methods & Protocols (Oxford University Press):
https://doi.org/10.1093/biomethods/bpae043
Preprint:
https://doi.org/10.1101/2023.11.29.569288
- Viet Thanh Duy Nguyen
- Truong Son Hy (Correspondent / PI)
git clone https://github.com/HySonLab/Protein_Pretrain.git
cd Protein_Pretrain
conda env create -f environment.yml
conda activate MPRL
- For pretraining:
- Create
/pretrain/data/swissprot
directory. - Download the data at Swiss Prot.
- Move the downloaded files into the
/pretrain/data/swissprot
directory and extract it.
- Create
- For downstream tasks:
- For pretraining, run following commands:
python ./pretrain/data/{model_name}.py
Replace {model_name} with the specific model identifier:
- VGAE: Variational Graph Autoencoder
- PAE: PointNet Autoencoder
- Auto-Fusion
- For downstream tasks, run following commands:
python ./downstreamtasks/data/{task_name}.py
Replace {task_name} with the specific task identifier:
- PLA: Protein-ligand Binding Affinity
- PFC: Protein Fold Classification
- EI: Enzyme Identification
- MSP: Mutation Stability Prediction
You have two options for using our models for pretraining: training from scratch or using our pre-trained model checkpoints.
If you prefer to train the model yourself, run the following commands:
python ./pretrain/{model_name}.py --mode your_mode
- Command-line Arguments
--mode
: Select the mode (train
ortest
).
For those who want to bypass the training phase, we provide pretrained model checkpoints that you can use directly in your projects. Download and integrate the pretrained model checkpoints from our shared drive: checkpoints
- Run following commands:
python ./downstreamtasks/{task_name}.py --mode your_mode --modal your_modal
- Command-line Arguments
--modal
: Select the modality (sequence
,graph
,point_cloud
ormultimodal
).--mode
: Select the mode (train
ortest
).--test_dataset
(Only available for PFC task): Select the test dataset for testing (test_family
,test_fold
, ortest_superfamily
).--dataset
(Only available for PLA task): Select the dataset (DAVIS
,KIBA
orPDBBind
)
@article{10.1093/biomethods/bpae043,
author = {Duy Nguyen, Viet Thanh and Son Hy, Truong},
title = "{Multimodal pretraining for unsupervised protein representation learning}",
journal = {Biology Methods and Protocols},
pages = {bpae043},
year = {2024},
month = {06},
abstract = "{Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures and functions is vital for progress in these fields. To capture this complexity, we introduce MPRL—Multimodal Protein Representation Learning, a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein\_Pretrain.}",
issn = {2396-8923},
doi = {10.1093/biomethods/bpae043},
url = {https://doi.org/10.1093/biomethods/bpae043},
eprint = {https://academic.oup.com/biomethods/advance-article-pdf/doi/10.1093/biomethods/bpae043/58272243/bpae043.pdf},
}
@article{Nguyen2023.11.29.569288,
author = {Viet Thanh Duy Nguyen and Truong Son Hy},
title = {Multimodal Pretraining for Unsupervised Protein Representation Learning},
elocation-id = {2023.11.29.569288},
year = {2023},
doi = {10.1101/2023.11.29.569288},
publisher = {Cold Spring Harbor Laboratory},
abstract = {In this paper, we introduce a framework of symmetry-preserving multimodal pretraining to learn a unified representation on proteins in an unsupervised manner that can take into account primary and tertiary structures. For each structure, we propose the corresponding pretraining method on sequence, graph and 3D point clouds based on large language models and generative models. We present a novel way to combining representations from multiple sources of information into a single global representation for proteins. We carefully analyze the performance of our framework in the pretraining tasks. For the fine-tuning tasks, our experiments have shown that our new multimodal representation can achieve competitive results in protein-ligand binding affinity prediction, protein fold classification, enzyme identification and mutation stability prediction. We expect that this work will accelerate future research in proteins. Our source code in PyTorch deep learning framework is publicly available at https://github.com/HySonLab/Protein_PretrainCompeting Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2023/12/02/2023.11.29.569288},
eprint = {https://www.biorxiv.org/content/early/2023/12/02/2023.11.29.569288.full.pdf},
journal = {bioRxiv}
}