Protein Stability Prediction with a Gaussian Network Model
(PSP-GNM)

About The Project

Protein stability prediction upon point mutation is a key biological problem. Single point mutations can alter protein function resulting in disease incidence. A very significant example is that of sickle cell anemia, wherein a single genomic mutation results in a single amino acid change and impairs the function of hemoglobin. It is therefore essential to develop methods that can predict the impact of point mutations on protein stability. More specifically, such methods should enable the estimation of free energy change (ddG) upon point mutation and be able to tell us whether a mutation reduces, increases or doesn't change the thermodynamic stability of proteins.

In this project, we introduce a novel approach to estimate the changes in free energy (ΔΔG) associated with point mutations. We refer to our approach as Protein Stability Prediction using Gaussian Network Model (PSP-GNM). For a given wildtype-mutant pair, PSP-GNM utilizes the Gaussian Network Model (GNM) to identify putative contacts and the order in which they are broken during simulated partial protein unfolding. We then use the knowledge of these broken contacts to estimate the ΔΔG.

Built With

Repository Contents

The repository includes 4 directories containing the relevant datasets and scripts for running PSP-GNM. It also includes a conda environment file (psp_gnm_env.yaml) that can be used to re-create the environment where the scripts can be run.

datasets/ - Contains the 3 benchmark datasets on which PSP-GNM was assessed.
predictions/ - Contains predictions made using different methods and PSP-GNM for different datasets
scripts/ - Includes the script: psp_gnm.py to run calculations
test_data/ - Includes test datasets to perform a test run of the 2 scripts

Getting Started

You will need access to a Linux terminal (or a MacOS, or even a Windows 10 having Windows Subsystem for Linux enabled and Ubuntu installed). The specific instructions below work best in a Linux (Ubuntu 18.02/Ubuntu20.04) platform.

To get a local copy up and execute the scripts, follow these simple steps:

Install miniconda:

wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.11.0-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

Clone the repository with git clone https://github.com/sambitmishra0628/PSP-GNM
On a terminal window, go to the PSP-GNM directory
Create the conda environment using the provided yaml file: conda env create -f psp_gnm_env.yaml
Activate the environment: conda activate psp_gnm
Checkout the command line arguments needed to run psp_gnm: python scripts/psp_gnm.py --help

The following help manual should be displayed

Usage: psp_gnm.py [OPTIONS]

Options:
  --data_file TEXT    Name of the .csv file containing the information on ddG
                      for the mutants  [required]
  --outfile TEXT      Name of the file to which the PSP-GNM-calculated
                      energies and experimental energies will be written
                      [required]
  --outdir TEXT       Name of the directory to which the intermittent result
                      files will be written to  [required]
  --wt_pdb_dir TEXT   Directory containing the wild type atomic pdb files. For
                      a reverse mutant, the wildtype is the forward mutant.
                      [required]
  --num_jobs TEXT     Maximum number of jobs to be run in parallel  [required]
  --dist_cutoff TEXT  Distance cutoff for interactions in GNM  [default: 9;
                      required]
  --num_modes TEXT    Number of modes to be used  [default: 10; required]
  --rev_mut_pdb       Set this option if data_file includes reverse mutants
                      and all the reverse mutants have the corresponding pdb
                      files of the forward mutant in wt_pdb_dir.  [default:
                      False]
  --help              Show this message and exit.

Prerequisites

You will need access to a Linux machine (or a MacOS or Windows 10 with Windows Subsystem for Linux enabled). You will need to have conda installed.

Usage

Before running make sure that your current working directory is the PSP_GNM/scripts/ directory.

python psp_gnm.py --data_file ../test_data/S350_test_benchmark_run.csv --outdir ../S350_test_run_output --outfile ../S350_test_benchmark_run_out.csv --wt_pdb_dir ../test_data/pdb_test --num_jobs 4 --dist_cutoff 9 --num_modes 10

In the above:

num_jobs is the number of parallel jobs you intend to run. Ideally, it should be set to the number of cores in your machine (N) - 1.
The above run will create an output directory S350_test_run_output, where all the intermediate files containing information on the contacts broken during partial unfolding will be stored.
The data_file should have comma separated values and at least include the following columns (the column names should exactly match as given below)

Column Name	Expected value
PDB_CHAIN	The 4-lettered PDB ID + Chain ID (e.g., 1AJ3A, 1AONU) (Case-sensitive)
WILD_RES	The single amino acid alphabet of the wildtype residue in the PDB file (Case-sensitive)
RES_NUM_PDB	The PDB residue number for the mutation position
MUTANT_RES	The single amino acid alphabet of the variant/mutant residue (Case-sensitive)
Category	Should be one of Forward or Reverse (case-sensitive). If unsure, then use Forward and then use the value under column ddG_PSP_GNM_unscaled in the output file for your analysis.

In the above, it is expected that position RES_NUM_PDB in the PDB file has the residue given by WILD_RES.

The output file S350_test_benchmark_run_out.csv will include the calculated ddG. Note that this output file will include all the columns present in the data file. Additionally, it will have the columns corresponding to calculations made by PSP-GNM. Explanation of the different output columns are as follows.

Column Name	Explanation
RES_IND_SEQ	The serial index of the mutation position. Starts from 0
Calc_ddG	The raw calculated energy difference between wildtype and mutant
Calc_ddI	The raw calculated entropy difference between wildtype and mutant. The entropy difference is measured as the difference in mean-squared fluctuation in distance.
Calc_ddG_mean	The average calculated energy difference between wildtype and mutant. Averaged across all residues considered for calculations.
Calc_ddI_mean	The average calculated entropy difference between wildtype and mutant. Averaged across all residues considered for calculations.
Num_contacts	Total contacts broken during partial unfolding involving the mutation residue and considered for calculations. We suggest considering only those calculations having Num_contacts > 0.
Calc_Energy_scaled	The scaled values for Calc_ddG
Calc_Entropy_scaled	The scaled values for Calc_ddI
ddG_PSP_GNM_unscaled	The unscaled prediction for ddG that incorporates both energy and entropy changes
ddG_PSP_GNM_scaled	The scaled prediction for ddG that incorporates both energy and entropy changes. This value is meaningless if the user is not sure of the category of mutation. If that's the case, then use the unscaled values.

The output directory will include *_contact_breaks.csv files. These files include information on the contacts broken during the partial unfolding simulation. A separate contact_break file will be created for each mutant listed in the S350_test_benchmark_run.csv file and its corresponding wildtype. For example, the 2nd row of the S350_test_benchmark_run.csv includes the following mutant.

PDB_CHAIN	WILD_RES	RES_NUM_PDB	MUTANT_RES	Category
1AJ3A	I	23	A	Forward

In the above case, the PDB ID is 1AJ3, the chain is A, wildtype residue is I, mutant residue is A and the position of mutation (according to the PDB file) is 23. The resulting contact_break file created during the unfolding simulation of the mutant form is 1AJ3A_I23A_contact_breaks.csv and the wildtype is 1AJ3A_wt_contact_breaks.csv. The first few rows for 1AJ3A_wt_contact_breaks.csv are shown below.

PDB_ID	WT_or_Mut	Mutation_position	Contact_Position	Res_at_Mut_Position	Res_at_Contact_Pos	Energy_MJ	Int_dist_change	Contact_break_rank
1AJ3A	wt	17	88	K	L	-3.37	2.355	1
1AJ3A	wt	88	17	L	K	-3.37	2.355	1
1AJ3A	wt	46	89	L	K	-3.37	2.301	2
1AJ3A	wt	89	46	K	L	-3.37	2.301	2
1AJ3A	wt	39	92	H	A	-2.41	2.305	3
1AJ3A	wt	92	39	A	H	-2.41	2.305	3
1AJ3A	wt	39	93	H	A	-2.41	2.308	4
1AJ3A	wt	93	39	A	H	-2.41	2.308	4

Note that there are duplicate rows for the same broken contact (e.g., 17 and 88) in this file. However, PSP-GNM only utilizes information on the unique broken contacts for ΔΔG calculations. A newer version of the code will be released in the future to output only the unique broken contacts in the *_contact_breaks file.

An explanation of each column in the table above is as follows.

Column Name	Explanation
PDB_ID	Name of the 4-lettered PDB ID with chain
WT_of_Mut	Whether the simulation was carried out for the wildtype (wt) or the mutant form
Mutation_postion	This is the sequential index (not the PDB residue number) of the first residue in the contact pair
Contact_position	This is just the sequential index (not the PDB residue number) of the second residue in the contact pair
Res_at_Mut_Position	Residue present at Mutation_position
Res_at_Contact_Position	Residue present at the Contact_Position
Energy_MJ	Interaction energy between Res_at_Mut_Position and Res_at_Contact_Position obtained from the MJ potential
Int_dist_change	Mean-squared fluctuation in distance, which we use to estimate entropy, between the contact pair
Contact_break_rank	The rank/order based on when this contact was broken during the simulation

For example, in the first row, residues at position 17 (K) and 88 (L) were originally in contact. That is, the spatial distance between their C-alpha atoms in the native structure was <= 9 Angstroms. During the unfolding simulation, this was the first residue pair that was broken as it had the highest mean-squared fluctuation in distance. Therefore, we assign the residue pair a Contact_break_rank of 1.

Use cases

1. Predicting ddG for reverse mutants using the structure of the native wildtype

For PDB ID 1AJ3 and chain A (wildtype PDB), the wildtype residue at position 18 is ASP. Let us assume a theoretical mutant form of this protein that has PHE at position 18. The forward mutant then is ASP18 -> PHE18 and the reverse mutant is PHE18 -> ASP18. If you want to test the antisymmetric property (ΔΔG_forward mutant = -ΔΔG_reverse mutant) of PSP-GNM, then use the following instance of the --data_file to calculate ΔΔG of the forward mutant.

PDB_CHAIN	WILD_RES	RES_NUM_PDB	MUTANT_RES	Category
1AJ3A	D	18	F	Forward

You can then calculate the ΔΔG of the reverse mutant using the same PDB file having the following content in the --data_file

PDB_CHAIN	WILD_RES	RES_NUM_PDB	MUTANT_RES	Category
1AJ3A	F	18	D	Reverse

2. Predicting ddG for reverse mutants using the structure of the forward mutant For PDB ID 1AJ3 and chain A, the wildtype residue at position 18 is ASP. Let us assume a theoretical mutant form of this protein that has PHE at position 18. Let us say you have the PDB structure of the forward mutant i.e, position 18 in your PDB contains PHE. Now to predict ΔΔG for the reverse mutant use the following instance of `--data_file`.

PDB_CHAIN	WILD_RES	RES_NUM_PDB	MUTANT_RES	Category
1AJ3A	F	18	D	Reverse

You will also need to include the --rev_mut_pdb runtime argument while running PSP-GNM.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Sambit Mishra - sambitmishra0628@gmail.com

Project Link: PSP-GNM

Citation

Mishra, S.K. PSP-GNM: Predicting Protein Stability Changes upon Point Mutations with a Gaussian Network Model. Int. J. Mol. Sci. 2022, 23, 10711. https://doi.org/10.3390/ijms231810711

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Stability Prediction with a Gaussian Network Model
(PSP-GNM)

Table of Contents

About The Project

Built With

Repository Contents

Getting Started

Prerequisites

Usage

Use cases

License

Contact

Citation

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
datasets		datasets
predictions		predictions
scripts		scripts
test_data		test_data
Github_image.png		Github_image.png
LICENSE.txt		LICENSE.txt
README.md		README.md
psp_gnm_env.yaml		psp_gnm_env.yaml

License

sambitmishra0628/PSP-GNM

Folders and files

Latest commit

History

Repository files navigation

Protein Stability Prediction with a Gaussian Network Model (PSP-GNM)

Table of Contents

About The Project

Built With

Repository Contents

Getting Started

Prerequisites

Usage

Use cases

License

Contact

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Protein Stability Prediction with a Gaussian Network Model
(PSP-GNM)

Packages