Application of machine learning approaches towards the prediction of gene expression levels in Streptococcus thermophilus
This project was developed to practice the implementation of computational tools and consolidate the knowledge built in curricular units integrated into the Master's Degree of Bioinformatics at the University of Minho under orientation from Martin Rau with affiliation from Discovery, Chr. Hansen A/S, Hørsholm, Denmark.
The main goal of this project is to construct a workflow for the prediction of gene expression levels in Streptococcus thermophilus from promoter sequences and analyze the prediction accuracy of different Machine Learning approaches. The chosen algorithms for this project were: Extreme Gradient Boosting (XGBoost), Random Forest Regressor, Support Vector Regression and Linear Regression.
The dataset used was extracted from an original article (https://www.frontiersin.org/articles/10.3389/fmicb.2018.00445/full) and the full genome files from the corresponding Streptococcus thermophilus strain (ASCC 1275) were downloaded from the NCBI database (Accession Number: CP006819.1).