Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a combined score with only human evidence #2

Open
dhimmel opened this issue Apr 13, 2020 · 4 comments
Open

Create a combined score with only human evidence #2

dhimmel opened this issue Apr 13, 2020 · 4 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Apr 13, 2020

For some applications, it might be nice to avoid any scores transferred from other species. Will look into creating a human-evidence-only combined score.

See this notebook for stats on the score distribution for each channel for human proteins. Note that the neighborhood and database_transferred channels are all zero for human proteins.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 13, 2020

Background quotes

Quotes from

STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, … Christian von Mering
Nucleic Acids Research (2018-11-22) https://doi.org/gfz2jr
DOI: 10.1093/nar/gky1131 · PMID: 30476243 · PMCID: PMC6323986

Within each channel, the evidence is further subdivided into two sub-scores, one of which represents evidence stemming from the organism itself, and the other represents evidence transferred from other organisms. For the latter transfer, the 'interolog' concept is applied (42,43); STRING uses hierarchically arranged orthologous group relations as defined in eggNOG (32), in order to transfer associations between organisms where applicable (described in (29)).

Quotes from https://string-db.org/help/faq/

The combined score is computed by combining the probabilities from the different evidence channels and corrected for the probability of randomly observing an interaction. For a more detailed description please see von Mering, et al. Nucleic Acids Res. 2005

From von Mering, et al. Nucleic Acids Res. 2005:

After assignment of association scores and transfer between species, we compute a final ‘combined score’ between any pair of proteins (or pair of COGs). This score is often higher than the individual sub-scores, expressing increased confidence when an association is supported by several types of evidence ( Table 1 ). It is computed under the assumption of independence for the various sources, in a naïve Bayesian fashion. It is thus a simple expression of the individual scores:
image

Also see the python script combine_subscores.py.

From FAQ "How to retrieve only the direct evidence in human, not transferred":

You need the file: protein.links.full.txt.gz, from which you can retrieve the columns like above and write it to a file.

zgrep ^"9606\." protein.links.full.txt.gz  | awk '($16 > 700) { print $1, $2, $3, $5, $6, $7, $8, $10, $12, $14, $16 }' > PPI_700_human.txt

Homology correction described in 2008 blog post:

In order to avoid that gene duplications lead spurious functional associations, homologous proteins are down-weighed in the co-occurrence and text-mining channels.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 14, 2020

Study that addresses why one might want to exclude transferred interactions:

  1. What Evidence Is There for the Homology of Protein-Protein Interactions?
    Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane
    PLoS Computational Biology (2012-09-20) https://doi.org/ggh6zz
    DOI: 10.1371/journal.pcbi.1002645 · PMID: 23028270 · PMCID: PMC3447968

Main take away:

Our results imply that, unless using strict definitions of homology, interactions rewire at a rate too fast to allow reliable transfer across species.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 14, 2020

Another question is whether to include the "genomic context prediction" channels. From the v11.0 paper:

The three genomic context prediction channels (neighborhood, fusion, gene co-occurrence) are the result of systematic all-against-all genome comparisons, aiming to assess the consequences of past genome rearrangements, gene gains and losses, as well as gene fusion events. These evolutionary events are known to be retained non-randomly with respect to the functional roles of genes, and thus allow the inference of functional associations between genes even for otherwise rarely studied organisms (genomic context techniques are reviewed in (44,45)).

dhimmel added a commit that referenced this issue Apr 14, 2020
@dhimmel
Copy link
Member Author

dhimmel commented Apr 14, 2020

See the 05.combine-subscores.ipynb notebook. It does look like excluding non-human evidence channels will cause a widespread drop in scores. Whether this is beneficial for a given application is another matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant