Our group (Martin Kaminskyj, Sami Elkhayri, Muzznah Ansari, Femi Adeleke, and Patricia Lan) chose to perform data analysis on coronavirus disease of 2019 (COVID-19). The topic of COVID-19 was chosen because it is a current, global issue with extensive socioeconomic impact, and public data is available. Thus, many approaches to analysis are possible.
This project evaluates government measures to contain the spread of COVID-19, starting from January 1, 2020 to August 31, 2020. The dataset from Oxford university's Blavatnik School of Government contained indices that categorized government responses to COVID-19, Google Community Mobility Reports recorded changes in people's mobility, and Our World in Data provided medical data on COVID-19 (e.g. number of confirmed cases and deaths).
Questions the project hoped to answer were the following:
- Are there general correlations between feature vs. target variables?
- Government measures: a) Which measures resulted in desirable outcomes? b) Which stringency measures are associated with lower rate of increase of total cases? Total deaths?
- Can an unsupervised machine learning (ML) model group countries by all variables to create meaningful group profiles? (Do group profiles correlate with desirable outcomes?)
- Supervised machine learning (ML): a) Can a model accurately predict future total cases/deaths based on feature variables? b) Which feature variables contribute the most to the model?
- High Risk Continent: Which continent poses the most risk based on age variation, total cases and extreme poverty rating.
View the Google Slides presentation for descriptions of the data exploration and analysis.
To view the code used to find correlations between feature and target variables, and to create an unsupervised ML model, visit the analysis_correlations_unsupervisedML directory. Jupyter notebooks are numbered from 1 to 11, which follows the progression of the analysis. Except for files containing raw data (see "Resources" below), the CSV files in the folder are intermediate files necessary for the progression of the analysis. Image files are byproducts of the notebooks and were used in Google Slides presentation.
To view the code used to analyse the correlation between the timing and severity of each countries initial stringency efforts and their outcome of total cases and deaths as a percentage of population, visit the analysis_gov_regulation_impact directory.
To view the code used to determine the continent that poses the highest risk based on age variation, total cases and poverty level, visit the analysis_median_age_per_continent directory.
Data was stored in a SQLite database (covid_db.db) to be queried for input into machine learning models. Figure 1 shows an entity relationship diagram (ERD) of the database.
Supervised machine learning(ML) models were created to predict target variables (total cases and total deaths) based on the following feature variables :
- Population
- Population density
- Median age
- Current total cases
- Current total death
- School Closing
- Workplace Closing
- Cancel public events
- Restriction on gatherings
- Close public transport
- Stay at home requirement
- Restriction on internal movement
- International travel controls
- Public information Campaigns
- Income support
Models included neural network, Support Vector Regression (SVR), and Random Forest Regressor. R-squared values were used to assess for the strength of correlations between feature and target variables. Model accuracy was assessed by comparing R-squared values between training and testing sets.
For selection and preprocessing of features, their data types and correlations were checked. Once the features were selected, both the feature variables, and targets were scaled.
Feature selection was an iterative process whereby the initial selected features were adjusted according to the outputted feature importance /contribution by the RandomForestRegressor model. For more detail on the decision-making process check the following notebooks:
-
FeatureSelection_FeatureImportance_totaldeaths.ipynb The above notebooks helped us determine the best features and model for further analysis.
-
Stringency_Component_Feature_Importance.ipynb The above notebooks were used to determine the importance of the individual components of the Stringency Index, HealthSupport Index, and EconomicSupport Index. Based on the findings, stay at home requirement, a restriction on internal movement and international travel control from the Stringency Index, in addition to income support (EconomicSupport Index) were the most significant features for the selected ML model.
The original dataset contained, total deaths and total cases for a given date. To predict future total cases and deaths, 12 new columns were created, three each for 30,45,60 and 75 days out. The columns were date plus a time-delta e.g. 30 days, total deaths at 30 days and total cases at 30 days. For coding detail check ML Notebook
The 95% of the data was used for training and 5% of the data was kept for testing purposes. For details check ML Notebook
The models tried were, SVR, RandomForestRegressor and Neural Networks. Out of the three we decided to use RandomForestRegressor for understanding feature contribution/importance and Neural Networks for predicting future target values.
- High R2 values
- Ability to extrapolate from a given dataset
- Neural Network is a flexible model that adapts itself to the shape of the data and can dynamically pick the best regression (i.e. logistic, linear, polynomial etc.) for a particular data.
- Neural Networks require specialized software to run and are computationally intensive.
- Black box: It does not explain how the results were achieved.
See Technology.md for a description of the technologies, languages, tools, and algorithms used in this project. Requirements.txt contains a list of the packages/modules/libraries needed to run the Jupyter Notebooks (in "analysis_gov_regulation_impact") for the supervised machine learning model.
Click here to view the dashboard. Files used to create the dashboard are available in the dashboard repo.
- regional_mobility.csv (Google Community Mobility Reports)
- owid-covid-data(Aug31,2020).csv (Our World in Data - COVID-19 database)
- OxCGRT_latest(Aug31,2020).csv (University of Oxford - COVID-19 Government Response Tracker)
See Technology.md and Requirements.txt.
- The dashboard
- Dashboard repo (Contains files used to create the dashboard.)
- Google Slides presentation (Provides detailed information about the project.)