TCGA data mining for establishing the relationship between sub-branches of Wnt signalling in colorectal cancer (CRC)
Project completed during my BSc in Biochemistry at the University of Southampton, 2018-2019.
- produce a library of Wnt signalling components and their associated genes of proteins
- search for CRC gene expression data (e.g. RNAseq) in publicly-available datasets (e.g. TCGA)
- ensure a fair data comparison (data was normalised using RSEM - RPKM modelled to account for isoform abundances)
- address the problem of missing data
- uses gplots, RColorBrewer and dunn.test R packages
- EXPLORATORY DATA ANALYSIS
- check distribution of data
- UNSUPERVISED MACHINE LEARNING
- hierarchical clustering (no prior information was known about the groups)
- Pearson correlation heatmap (for computing distances using complete linkage)
- CLUSTER ANALYSIS FOR SIGNIFICANT GROUPS
- Kruskal-Wallis test for comparing groups
- post-hoc Dunn's test with Benjamini-Hochberg p-value correction for finding which group was different
- hypergeometric test for the probabilities of randomly picking the samples from a certain category
- FURTHER ANALYSIS
- gene set enrichment (PANTHER)
- PPIs network (STRING)