Strainflow

This repository contains code for the paper Genomic Surveillance of COVID-19 variants with Language Models and Machine Learning.

About The Project

We developed Strainflow for SARS-CoV-2 genome sequences, where sequences were treated as documents with words (codons) to learn the codon context of 0.9 million spike genes using the skip-gram algorithm.
Time series analysis of the information content (Entropy) of the latent dimensions learnt by Strainflow shows a leading relationship with the monthly COVID-19 cases for 7 countries (eg. USA, Japan, India and others).
Machine Learning modeling of the entropy of the latent dimensions helped us to develop an epidemiological early warning system for the COVID-19 caseloads.
The top codons associated with the most relevant latent dimensions (DoCs) were linked to SARS-CoV-2 variants, and DoCs may be used as a surrogate to track the country-specific spread of the variants.

File Details

File	Description	Stage
01-A1-train-preprocessing.py	Use CoV-Seq script to identify different regions in genome sequences contained in Fasta files	Train data preprocessing
01-A2-train-preprocessing.py	Generate country-wise CSVs for spike gene sequences and associated metadata	Train data preprocessing
02-A-w2v-train.py	Train word2vec skip-gram model on spike gene sequences	word2vec training
02-B-pip-loss.py	Calculate and plot the cumulative PIP loss to select the dimension size of the word embeddings	word2vec model selection
03-A1-augur2covseq.R	GISAID augur format to CoV-Seq annotation	Prediction preprocessing
03-A2-run-2021-04-augur2fa.sh	April 2021	Prediction preprocessing
03-A2-run-2021-05-augur2fa.sh	May 2021	Prediction preprocessing
03-A2-run-2021-06-augur2fa.sh	June 2021	Prediction preprocessing
03-B-prediction-data-preprocess-fasta2w2v.R	Use CoV-Seq annotation file, metadata and fasta to generate input for w2v model	Prediction preprocessing
03-C-run-2021-04-fa2w2v.sh	April 2021	Prediction preprocessing
03-C-run-2021-05-fa2w2v.sh	May 2021	Prediction preprocessing
03-C-run-2021-06-fa2w2v.sh	June 2021	Prediction preprocessing
04-A-predict-w2v.py	Apr-Jun 2021	Embedding generation for test data
04-B-combine-embeddings.py	Combine all predictions into a single CSV	Embedding generation for test data
05-A-LD-train-prediction-sfv1.R	Combine the train and prediction latent dimension (LD) of strainflow (sf) version 1 (v1)	Analysis
05-B-Fig-01BC-tSNE-viz.R	Generate tSNE visualisation train LD from 09-2020 to 03-2021 for all (World), UK, China, USA, India	Analysis
05-C-Fig-01D-entropy.R	Generate tSNE visualisation prediction LD from 04-2021 to 06-2021 for all (World), UK, China, USA, India	Analysis
06-Fig-02-dendrogram.py	Calculate cosine similarities between random sequences for plotting dendrogram (phylogenetic tree) on iTol tool	Analysis
07-entropy-calculation.R	Calculate monthly sample entropy for each spike gene latent dimension for a given country	Feature generation
08-new-cases.py	Calculate monthly new cases for each country using data from the JHU COVID-19 cases repository	Outcome variable data collection
09-DCCA-Fig-04-A.R	Calculate and plot the DCCA coefficient between each entropy dimension and new cases for different lead and lag periods. (Figure 4A)	Analysis
10-A-Boruta-RF.R	Entropy dimensions are modeled to predict COVID-19 new cases in the next to next month. Feature selection with Boruta algorithm is performed and Random Forest Regression model is trained. Model inference is performed and Correlation between the actual and predicted values is calculated. (Figure 4C, Table 1)	Predictive Modeling
10-B-Fig-03-05-06-plots.py	Figures 3, 5, 6
11-w2v-weights.py	Top codons for each latent dimension are found using embeddings from the trained word2vec model	Interpretability

Python Packages Installation

Use the package manager pip to install the required Python packages.

pip install requirements.txt

Authors

Contributor names and contact info

Sargun Nagpal - sargunnagpal@gmail.com
Ridam Pal - ridamp@iiitd.ac.in
Ashima - ashima19031@iiitd.ac.in
Ananya Tyagi - ananya19114@iiitd.ac.in
Sadhana Tripathi - sadhana20214@iiitd.ac.in
Aditya Nagori - aditya.kumar@igib.in
Saad Ahmad - saad18409@iiitd.ac.in
Hara Prasad Mishra - harap@iiitd.ac.in

Corresponding Authors

Tavpritesh Sethi - tavpriteshsethi@iiitd.ac.in
Rintu Kutum - rintuk@iiitd.ac.in

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strainflow

About The Project

File Details

Python Packages Installation

Authors

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
softs		softs
.gitignore		.gitignore
01-A1-train-preprocessing.py		01-A1-train-preprocessing.py
01-A2-train-preprocessing.py		01-A2-train-preprocessing.py
02-A-train-w2v.py		02-A-train-w2v.py
02-B-pip-loss.py		02-B-pip-loss.py
03-A1-augur2covseq.R		03-A1-augur2covseq.R
03-A2-run-2021-04-augur2fa.sh		03-A2-run-2021-04-augur2fa.sh
03-A2-run-2021-05-augur2fa.sh		03-A2-run-2021-05-augur2fa.sh
03-A2-run-2021-06-augur2fa.sh		03-A2-run-2021-06-augur2fa.sh
03-B-prediction-data-preprocess-fasta2w2v.R		03-B-prediction-data-preprocess-fasta2w2v.R
03-C-run-2021-04-fa2w2v.sh		03-C-run-2021-04-fa2w2v.sh
03-C-run-2021-05-fa2w2v.sh		03-C-run-2021-05-fa2w2v.sh
03-C-run-2021-06-fa2w2v.sh		03-C-run-2021-06-fa2w2v.sh
04-A-predict-w2v.py		04-A-predict-w2v.py
04-B-combine-embeddings.py		04-B-combine-embeddings.py
05-A-LD-train-prediction-sfv1.R		05-A-LD-train-prediction-sfv1.R
05-B-Fig-01BC-tSNE-viz.R		05-B-Fig-01BC-tSNE-viz.R
05-C-Fig-01D-entropy.R		05-C-Fig-01D-entropy.R
06-Fig-02-dendrogram.py		06-Fig-02-dendrogram.py
07-entropy-calculation.R		07-entropy-calculation.R
08-new-cases.py		08-new-cases.py
09-DCCA-Fig-04-A.R		09-DCCA-Fig-04-A.R
10-A-Boruta-RF.R		10-A-Boruta-RF.R
10-B-Fig-03-05-06-plots.py		10-B-Fig-03-05-06-plots.py
11-w2v-weights.py		11-w2v-weights.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

rintukutum/strainflow_manuscript

Folders and files

Latest commit

History

Repository files navigation

Strainflow

About The Project

File Details

Python Packages Installation

Authors

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages