Predicting gene expression levels from upstream promoter regions using deep learning.
scripts/
: directory for production code
0-data-loading-processing/
:01-gene-expression.py
: downloads and processes gene expression data and saves into "B73_genex.txt", may be skipped02-download-process-db-data.py
: downloads and processes gene sequences from a specified database: 'Ensembl', 'Maize', 'Maize_addition', 'Refseq', consider as the actual first step03-combine-databases.py
: combines all the downloaded sequences within all the databases, to be executed after obtaining all processed sequences using02-download-process-db-data.py
from the db's04-process-genex-nam.py
: Pipeline task for processing gene expression data for NAM lines, splitting data for transformer modelling, to be executed after obtaining data from usegalaxy web and06-train-test-split.py
05-cluster-maize_seq.sh
: clusters the promoter sequences into groups with up to 80% sequence identity, which may be interpreted as paralogs, to be executed after03-combine-databases.py
on Maize NAM data06-train-test-split.py
: divides the promoter sequences into train and test sets, avoiding a set of pairs that indicate close relations ("paralogs"), to be executed after05-cluster-maize_seq.sh
07_train_tokenizer.py
: training byte-level BPE for RoBERTa model, combine all the sequences (excluding Maize), train-test split the combined data (not related to 05-06-04 steps mentioned above, use data from 03) and use train split for training the tokenizer08-train-preprocessor.py
: Training custom preprocessor for gene expression prediction, may be skipped if no unique preprocessor is reqd09-prepare-nam_metadata.py
: Preparing cultivar and mazize subpopulation data, can be created by obtaining data from internet directly
1-modeling/
pretrain.py
: training the FLORABERT base using a masked language modeling task. Typepython scripts/1-modeling/pretrain.py --help
to see command line options, including choice of dataset and whether to warmstart from a partially trained model. Note: not all options will be used by this script.finetune.py
: training the FLORABERT regression model (including newly initialized regression head) on multitask regression for gene expression in all 10 tissues. Typepython scripts/1-modeling/finetune.py --help
to see command line options; mainly for specifying data inputs and output directory for saving model weights.evaluate.py
: computing metrics for the trained FLORABERT model
2-feature-visualization/
embedding_vis.py
: computing a sample of BERT embeddings for the testing data and saving to a tensorboard log. Can specify how many embeddings to sample with--num-embeddings XX
whereXX
is the number of embeddings (must be integer).
module/
: directory for our customized modules
module/
: our main module namedflorabert
that packages customized functionsconfig.py
: project-wide configuration settings and absolute paths to important directories/filesdataio.py
: utilities for performing I/O operations (reading and writing to/from files)gene_db_io.py
: helper functions to download and process gene sequencesmetrics.py
: functions for evaluating modelsnlp.py
: custom classes and functions for working with text/sequencestraining.py
: helper functions that make it easier to train models in PyTorch and with Huggingface's Trainer API, as well as custom optimizers and schedulerstransformers.py
: implementation of RoBERTa model with mean-pooling of final token embeddings, as well as functions for loading and working with Huggingface's transformers libraryutils.py
: General-purpose functions and codevisualization.py
: helper functions to perform random k-mer flip during data processing and make model prediction
If you wish to experiment with our pre-trained FLORABERT models, you can find the saved PyTorch models and the Huggingface tokenizer files here
Contents:
byte-level-bpe-tokenizer
: Files expected by a Huggingfacetransformers.PretrainedTokenizer
merges.txt
vocab.txt
- transformer: Both language models can instantiate any RoBERTa model from Huggingface's
transformers
library. The prediction model should instantiate our customRobertaForSequenceClassificationMeanPool
model classlanguage-model
: Trained on all plant promoter sequenceslanguage-model-finetuned
: Further trained on just maize promoter sequencesprediction-model
: Fine-tuned on the multitask regression problem
Kindly refer gz_link
for the raw data and links used for data collection and preprocessing, research_papers
for the 26 NAM lines and the references used throughout the project and config.py
and the corresponding config.yaml
for the configurations used (the tissues array in config.py correspond to the values in labels col of data). Majority of the execution has been carried out in Colab Notebook
and Kaggle Notebook
First module has been completed. All data / outputs are under data
or models
. Moving to Second Module. The following steps were essential for this script.
The following updates have been done using python scripts under 3-RNAseq-quantification/
:
-
The scripts under this module requires a lot of resources and time (patience). We opted to use the Bioinformatics website Galaxy. This provides every user 250GB storage and allows the ability to use a number of very useful and important bioinformatics tools.
-
The scripts under the module dealt with 26 NAM lines / cultivars of Maize. We replicated the entire process under this module in this website, with some minor changes (not in output). The first step was to get all the runs corresponding to each cultivar and unique organsim part for each, to avoid repitition.
-
This was achieved by getting the base data from EBI and searching for the 2 Bioprojects mentioned in the supplementary material (under
research_papers
). This data was then used alongside thehelper_codes
scripts to get the fileunique_orgs_runs.tsv
. This file contains the runs corresponding to unique organism parts of each cultivar. -
A workflow was then created / implemented / configured (the base workflow was created by user vasquex11 on the mentioned website) to align with the scripts. The runs were first uploaded per cultivar to the website (after logging in) in txt format, one per line. Next, fasterq-dump tool was used with --split-files option selected to get the fastq files corresponding to the runs.
-
The created workflow
FloraBERT Test (Trimmomatic + HISAT2 + featureCounts)
was used to perform all the actions mentioned in the module. The final output are the featureCounts files corresponding to each run ( extending to unique organsim part of cultivars ). The steps are self-explanatory (using the research papers).
The full train-test data for pretraining is available at Hugging Face Dataset For Pretraining
, for finetuning on Maize NAM data at Hugging Face Dataset For Finetuning
and for finetuning for the downstream task at Hugging Face Dataset for Downstream task
Evaluation results
are also under outputs/
.
Some observations:
- using different transformations to handle the highly right skewed TPM values (during finetuning stage):
- natural log transformation gave an mse of 2.4
- boxcox transformation gave an mse of 12.9
- log10 transformation gave an mse of 0.4441
- log10 also improved r2 by 0.01 (from 0.077 to 0.087) which is considerable here (no other changes)
- this means that handling the highly right skewed is pivotal here for model performance
- TPU usage on pytorch is a herculean task for our use-case
- custom finetuning script might need some tweaking
- python dependencies are problematic at times
- DNABERT-1 makes use of k-mers meaning the tokenizer files that are made are basically 4^k permutations ('A', 'T', 'G', 'C'), not great (now) for tasks dealing with DNA
- DNABERT-2 is an obvious improvement as it makes use of actual tokenization, preventing overlapping and better identification of important parts of DNA seqeunces
- Byte-level-BPE-tokenizer makes use of 256 basic (byte-level) tokens; taking this into consideration, new vocab_size is set to 5256
- padding HAS TO true OR "max_length" WHEN TRAINING ON TPU (as a parameter for tokenizer; used in preprocess_fn in dataio.py in this proj)
- torch, torch_xla, torch_xla.core.xla_model (as xm) have to be imported to make sure Training (Trainer) works properly on TPU (tested on kaggle)
Citations:
- Base Papers for FloraBERT are available under
research_papers
- The Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update, Nucleic Acids Research, Volume 50, Issue W1, 5 July 2022, Pages W345–W351, doi:10.1093/nar/gkac247
- National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – [cited 2024 Jan 26]. Available from: https://www.ncbi.nlm.nih.gov/
- Project: PRJEB35943 at EBI
- Project: PRJEB36014 at EBI
- Andrew D. Yates, James Allen, Ridwan M. Amode, Andrey G. Azov, Matthieu Barba, Andrés Becerra, Jyothish Bhai, Lahcen I. Campbell, Manuel Carbajo Martinez, Marc Chakiachvili, Kapeel Chougule, Mikkel Christensen, Bruno Contreras-Moreira, Alayne Cuzick, Luca Da Rin Fioretto, Paul Davis, Nishadi H. De Silva, Stavros Diamantakis, Sarah Dyer, Justin Elser, Carla V. Filippi, Astrid Gall, Dionysios Grigoriadis, Cristina Guijarro-Clarke, Parul Gupta, Kim E. Hammond-Kosack, Kevin L. Howe, Pankaj Jaiswal, Vinay Kaikala, Vivek Kumar, Sunita Kumari, Nick Langridge, Tuan Le, Manuel Luypaert, Gareth L. Maslen, Thomas Maurel, Benjamin Moore, Matthieu Muffato, Aleena Mushtaq, Guy Naamati, Sushma Naithani, Andrew Olson, Anne Parker, Michael Paulini, Helder Pedro, Emily Perry, Justin Preece, Mark Quinton-Tulloch, Faye Rodgers, Marc Rosello, Magali Ruffier, James Seager, Vasily Sitnik, Michal Szpak, John Tate, Marcela K. Tello-Ruiz, Stephen J. Trevanion, Martin Urban, Doreen Ware, Sharon Wei, Gary Williams, Andrea Winterbottom, Magdalena Zarowiecki, Robert D. Finn and Paul Flicek. Ensembl Genomes 2022: an expanding genome resource for non-vertebrates., Nucleic Acids Research 2022, https://doi.org/10.1093/nar/gkab1007
- MaizeGDB: Woodhouse MR, Cannon EK, Portwood JL, Harper LC, Gardiner JM, Schaeffer ML, Andorf CM. (2021) A pan-genomic approach to genome databases using maize as a model system. BMC Plant Biol 21, 385. doi: https://doi.org/10.1186/s12870-021-03173-5.
- O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. Reference sequence RefSeq database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45 PubMed PubMedCentral