TCGA data mining for establishing the relationship between sub-branches of Wnt signalling in colorectal cancer (CRC)

Project completed during my BSc in Biochemistry at the University of Southampton, 2018-2019.

Data extraction

produce a library of Wnt signalling components and their associated genes of proteins
search for CRC gene expression data (e.g. RNAseq) in publicly-available datasets (e.g. TCGA)

ensure a fair data comparison (data was normalised using RSEM - RPKM modelled to account for isoform abundances)
address the problem of missing data

EXPLORATORY DATA ANALYSIS
- check distribution of data
UNSUPERVISED MACHINE LEARNING
- hierarchical clustering (no prior information was known about the groups)
- Pearson correlation heatmap (for computing distances using complete linkage)
CLUSTER ANALYSIS FOR SIGNIFICANT GROUPS
- Kruskal-Wallis test for comparing groups
- post-hoc Dunn's test with Benjamini-Hochberg p-value correction for finding which group was different
- hypergeometric test for the probabilities of randomly picking the samples from a certain category
FURTHER ANALYSIS
- gene set enrichment (PANTHER)
- PPIs network (STRING)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Data_pre-processing_distribution_clustering.R		Data_pre-processing_distribution_clustering.R
Extract_cluster_without_sample_type.sh		Extract_cluster_without_sample_type.sh
Kruskal-Wallis_statistical_significance.R		Kruskal-Wallis_statistical_significance.R
README.md		README.md