GitHub - panyuwen/mGSEA: modified gene set enrichment analysis (with a fuzzy ranked gene list)

mGSEA (modified gene set enrichment analysis)

To perform GSEA with a fuzzy ranked gene list

If you use this script in your research work, please cite at least one of the following paper(s):

Genomic diversity and post-admixture adaptation in the Uyghurs (National Science Review, 2021)

Dependencies:

python2.7
argparse
numpy
pandas
concurrent

Usage:

python mGSEA.py -h

python mGSEA.py \
	--qrank genelist.rnk \
	--gset kegg.gmt \
	--qsize 0.005 \
	--perm 2000 \
	--threads 50 \
	--out output

Details about the arguments

--qrank (required): file format: , tab-delimited, no header, no duplicated genes, better to keep only genes in the background gene set, e.g., KEGG genes. “quantile value” ranges from 0 to 1.0 (or something will go wrong)

--gset (required): Priori target gene list, all of the genes should be in the ranked genelist, gmt format

--qsize (required): Window size of the non-overlapping quantile regions, from 1.0 to 0.0, e.g., 0.005

--perm (optional): permutation, Default: 2000

--threads (optional): Number of threads, Default: 50

--out (required): prefix name for output file

Output:

*.enrichment_score_quarter.txt.gz: enrichment scores after permutation

*.cumulative_score.txt.gz: cumulative scores for the observed data

*.enrichment_Pvalue.txt: combined summary statistics

Rationale:

Gene Set Enrichment Analysis (GSEA) is a statistical method developed based on the Kolmogorove-Smirnov test to identify the enrichment of biologically functional categories in a ranked gene list. But the gene list was not strictly ranked in our study due to the resolution of the empirical P-values. We modified the traditional GSEA by sorting genes according to their quantiles (1.0 - P-values) into 200 bins of size 0.005.

For a given gene set $S$ (a pathway or a priori functional gene set) and a quantile bin $B$, we respectively defined the “hit” score $P_{hit}(S, B)$ and “miss” score $P_{miss}(S,B)$ as:

$$ P_{hit}(S, B)=\frac{n_{BS}}{N_S};P_{miss}(S, B)=\frac{n_B-n_{BS}}{N_{tot}-N_S}, $$

in which the $N_{tot}$ stands for the total number of genes (across all the quantile bins), $N_S$ stands for the number of genes in $S$, $n_B$ stands for the number of genes in $B$, and $n_{BS}$ stands for the number of genes in in $B$ belonging to $S$. Thus, we defined the net score $P_{net}(S,B)$ for a given bin as:

$$ P_{net}(S,B)=P_{hit}(S,B)-P_{miss}(S,B). $$

The cumulative score $P_{net}(S)$ was calculated as

$$ P_{net}(S)=\sum P_{net}(S, B_i), $$

in which the $B_i$ was the ith bin in the ranking list. In addition, the enrichment score $ES(S)$ was defined as the maximum deviation of $P_{net}(S)$ from zero. The significance of an observed $ES(S)$ was estimated by comparing it with the NULL distribution of $ES$ ($ES_{NULL}$), which was generated by permuting the gene-pathway relationship 20,000 times. Finally, the nominal empirical P-value for the gene set $S$ was calculated with one-tailed significance test using either the positive or negative portion of the NULL distribution according to the sign of the observed $ES(S)$.

By: Yuwen Pan, 2021
Contact: panyuwen.x@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
mGSEA.py		mGSEA.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependencies:

Usage:

Output:

Rationale:

About

Releases

Packages

Languages

License

panyuwen/mGSEA

Folders and files

Latest commit

History

Repository files navigation

Dependencies:

Usage:

Output:

Rationale:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages