Skip to content

Latest commit

 

History

History
244 lines (179 loc) · 14.2 KB

README.md

File metadata and controls

244 lines (179 loc) · 14.2 KB

Protein Stability Prediction with a Gaussian Network Model
(PSP-GNM)

PSP-GNM outline

Table of Contents

  1. About The Project
  2. Repository Contents
  3. Getting Started
  4. Usage
  5. Use cases
  6. License
  7. Contact
  8. Citation

About The Project

Protein stability prediction upon point mutation is a key biological problem. Single point mutations can alter protein function resulting in disease incidence. A very significant example is that of sickle cell anemia, wherein a single genomic mutation results in a single amino acid change and impairs the function of hemoglobin. It is therefore essential to develop methods that can predict the impact of point mutations on protein stability. More specifically, such methods should enable the estimation of free energy change (ddG) upon point mutation and be able to tell us whether a mutation reduces, increases or doesn't change the thermodynamic stability of proteins.

In this project, we introduce a novel approach to estimate the changes in free energy (ΔΔG) associated with point mutations. We refer to our approach as Protein Stability Prediction using Gaussian Network Model (PSP-GNM). For a given wildtype-mutant pair, PSP-GNM utilizes the Gaussian Network Model (GNM) to identify putative contacts and the order in which they are broken during simulated partial protein unfolding. We then use the knowledge of these broken contacts to estimate the ΔΔG.

Built With

Repository Contents

The repository includes 4 directories containing the relevant datasets and scripts for running PSP-GNM. It also includes a conda environment file (psp_gnm_env.yaml) that can be used to re-create the environment where the scripts can be run.

  1. datasets/ - Contains the 3 benchmark datasets on which PSP-GNM was assessed.
  2. predictions/ - Contains predictions made using different methods and PSP-GNM for different datasets
  3. scripts/ - Includes the script: psp_gnm.py to run calculations
  4. test_data/ - Includes test datasets to perform a test run of the 2 scripts

Getting Started

You will need access to a Linux terminal (or a MacOS, or even a Windows 10 having Windows Subsystem for Linux enabled and Ubuntu installed). The specific instructions below work best in a Linux (Ubuntu 18.02/Ubuntu20.04) platform.

To get a local copy up and execute the scripts, follow these simple steps:

  • Install miniconda:
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.11.0-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
  • Clone the repository with git clone https://github.com/sambitmishra0628/PSP-GNM
  • On a terminal window, go to the PSP-GNM directory
  • Create the conda environment using the provided yaml file: conda env create -f psp_gnm_env.yaml
  • Activate the environment: conda activate psp_gnm
  • Checkout the command line arguments needed to run psp_gnm: python scripts/psp_gnm.py --help

The following help manual should be displayed

Usage: psp_gnm.py [OPTIONS]

Options:
  --data_file TEXT    Name of the .csv file containing the information on ddG
                      for the mutants  [required]
  --outfile TEXT      Name of the file to which the PSP-GNM-calculated
                      energies and experimental energies will be written
                      [required]
  --outdir TEXT       Name of the directory to which the intermittent result
                      files will be written to  [required]
  --wt_pdb_dir TEXT   Directory containing the wild type atomic pdb files. For
                      a reverse mutant, the wildtype is the forward mutant.
                      [required]
  --num_jobs TEXT     Maximum number of jobs to be run in parallel  [required]
  --dist_cutoff TEXT  Distance cutoff for interactions in GNM  [default: 9;
                      required]
  --num_modes TEXT    Number of modes to be used  [default: 10; required]
  --rev_mut_pdb       Set this option if data_file includes reverse mutants
                      and all the reverse mutants have the corresponding pdb
                      files of the forward mutant in wt_pdb_dir.  [default:
                      False]
  --help              Show this message and exit.

Prerequisites

You will need access to a Linux machine (or a MacOS or Windows 10 with Windows Subsystem for Linux enabled). You will need to have conda installed.

Usage

Before running make sure that your current working directory is the PSP_GNM/scripts/ directory.

python psp_gnm.py --data_file ../test_data/S350_test_benchmark_run.csv --outdir ../S350_test_run_output --outfile ../S350_test_benchmark_run_out.csv --wt_pdb_dir ../test_data/pdb_test --num_jobs 4 --dist_cutoff 9 --num_modes 10

In the above:

  • num_jobs is the number of parallel jobs you intend to run. Ideally, it should be set to the number of cores in your machine (N) - 1.

  • The above run will create an output directory S350_test_run_output, where all the intermediate files containing information on the contacts broken during partial unfolding will be stored.

  • The data_file should have comma separated values and at least include the following columns (the column names should exactly match as given below)

Column Name Expected value
PDB_CHAIN The 4-lettered PDB ID + Chain ID (e.g., 1AJ3A, 1AONU) (Case-sensitive)
WILD_RES The single amino acid alphabet of the wildtype residue in the PDB file (Case-sensitive)
RES_NUM_PDB The PDB residue number for the mutation position
MUTANT_RES The single amino acid alphabet of the variant/mutant residue (Case-sensitive)
Category Should be one of Forward or Reverse (case-sensitive). If unsure, then use Forward and then use the value under column ddG_PSP_GNM_unscaled in the output file for your analysis.

In the above, it is expected that position RES_NUM_PDB in the PDB file has the residue given by WILD_RES.

  • The output file S350_test_benchmark_run_out.csv will include the calculated ddG. Note that this output file will include all the columns present in the data file. Additionally, it will have the columns corresponding to calculations made by PSP-GNM. Explanation of the different output columns are as follows.
Column Name Explanation
RES_IND_SEQ The serial index of the mutation position. Starts from 0
Calc_ddG The raw calculated energy difference between wildtype and mutant
Calc_ddI The raw calculated entropy difference between wildtype and mutant. The entropy difference is measured as the difference in mean-squared fluctuation in distance.
Calc_ddG_mean The average calculated energy difference between wildtype and mutant. Averaged across all residues considered for calculations.
Calc_ddI_mean The average calculated entropy difference between wildtype and mutant. Averaged across all residues considered for calculations.
Num_contacts Total contacts broken during partial unfolding involving the mutation residue and considered for calculations. We suggest considering only those calculations having Num_contacts > 0.
Calc_Energy_scaled The scaled values for Calc_ddG
Calc_Entropy_scaled The scaled values for Calc_ddI
ddG_PSP_GNM_unscaled The unscaled prediction for ddG that incorporates both energy and entropy changes
ddG_PSP_GNM_scaled The scaled prediction for ddG that incorporates both energy and entropy changes. This value is meaningless if the user is not sure of the category of mutation. If that's the case, then use the unscaled values.
  • The output directory will include *_contact_breaks.csv files. These files include information on the contacts broken during the partial unfolding simulation. A separate contact_break file will be created for each mutant listed in the S350_test_benchmark_run.csv file and its corresponding wildtype. For example, the 2nd row of the S350_test_benchmark_run.csv includes the following mutant.
PDB_CHAIN WILD_RES RES_NUM_PDB MUTANT_RES Category
1AJ3A I 23 A Forward

In the above case, the PDB ID is 1AJ3, the chain is A, wildtype residue is I, mutant residue is A and the position of mutation (according to the PDB file) is 23. The resulting contact_break file created during the unfolding simulation of the mutant form is 1AJ3A_I23A_contact_breaks.csv and the wildtype is 1AJ3A_wt_contact_breaks.csv. The first few rows for 1AJ3A_wt_contact_breaks.csv are shown below.

PDB_ID WT_or_Mut Mutation_position Contact_Position Res_at_Mut_Position Res_at_Contact_Pos Energy_MJ Int_dist_change Contact_break_rank
1AJ3A wt 17 88 K L -3.37 2.355 1
1AJ3A wt 88 17 L K -3.37 2.355 1
1AJ3A wt 46 89 L K -3.37 2.301 2
1AJ3A wt 89 46 K L -3.37 2.301 2
1AJ3A wt 39 92 H A -2.41 2.305 3
1AJ3A wt 92 39 A H -2.41 2.305 3
1AJ3A wt 39 93 H A -2.41 2.308 4
1AJ3A wt 93 39 A H -2.41 2.308 4

Note that there are duplicate rows for the same broken contact (e.g., 17 and 88) in this file. However, PSP-GNM only utilizes information on the unique broken contacts for ΔΔG calculations. A newer version of the code will be released in the future to output only the unique broken contacts in the *_contact_breaks file.

An explanation of each column in the table above is as follows.

Column Name Explanation
PDB_ID Name of the 4-lettered PDB ID with chain
WT_of_Mut Whether the simulation was carried out for the wildtype (wt) or the mutant form
Mutation_postion This is the sequential index (not the PDB residue number) of the first residue in the contact pair
Contact_position This is just the sequential index (not the PDB residue number) of the second residue in the contact pair
Res_at_Mut_Position Residue present at Mutation_position
Res_at_Contact_Position Residue present at the Contact_Position
Energy_MJ Interaction energy between Res_at_Mut_Position and Res_at_Contact_Position obtained from the MJ potential
Int_dist_change Mean-squared fluctuation in distance, which we use to estimate entropy, between the contact pair
Contact_break_rank The rank/order based on when this contact was broken during the simulation

For example, in the first row, residues at position 17 (K) and 88 (L) were originally in contact. That is, the spatial distance between their C-alpha atoms in the native structure was <= 9 Angstroms. During the unfolding simulation, this was the first residue pair that was broken as it had the highest mean-squared fluctuation in distance. Therefore, we assign the residue pair a Contact_break_rank of 1.

Use cases

1. Predicting ddG for reverse mutants using the structure of the native wildtype

For PDB ID 1AJ3 and chain A (wildtype PDB), the wildtype residue at position 18 is ASP. Let us assume a theoretical mutant form of this protein that has PHE at position 18. The forward mutant then is ASP18 -> PHE18 and the reverse mutant is PHE18 -> ASP18. If you want to test the antisymmetric property (ΔΔG_forward mutant = -ΔΔG_reverse mutant) of PSP-GNM, then use the following instance of the --data_file to calculate ΔΔG of the forward mutant.

PDB_CHAIN WILD_RES RES_NUM_PDB MUTANT_RES Category
1AJ3A D 18 F Forward

You can then calculate the ΔΔG of the reverse mutant using the same PDB file having the following content in the --data_file

PDB_CHAIN WILD_RES RES_NUM_PDB MUTANT_RES Category
1AJ3A F 18 D Reverse

2. Predicting ddG for reverse mutants using the structure of the forward mutant For PDB ID 1AJ3 and chain A, the wildtype residue at position 18 is ASP. Let us assume a theoretical mutant form of this protein that has PHE at position 18. Let us say you have the PDB structure of the forward mutant i.e, position 18 in your PDB contains PHE. Now to predict ΔΔG for the reverse mutant use the following instance of `--data_file`.
PDB_CHAIN WILD_RES RES_NUM_PDB MUTANT_RES Category
1AJ3A F 18 D Reverse

You will also need to include the --rev_mut_pdb runtime argument while running PSP-GNM.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Sambit Mishra - sambitmishra0628@gmail.com

Project Link: PSP-GNM

Citation

Mishra, S.K. PSP-GNM: Predicting Protein Stability Changes upon Point Mutations with a Gaussian Network Model. Int. J. Mol. Sci. 2022, 23, 10711. https://doi.org/10.3390/ijms231810711