Skip to content

This Spark App analyses various covid cases data and enables you to create custom mathematical insights using a unified data structure and a trait method. After processing data it then writes to Cassandra which is then used as primary source for Data Visualization.

Notifications You must be signed in to change notification settings

nihadtp/covid19AnlaysisSpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

covid-19 Case Data Analysis (Indian States)

This Spark App analyses various covid cases data and enables you to create custom mathematical insights using a unified data structure and a trait method. After processing data it then writes to Cassandra which is then used as primary source for Data Visualization. Some analysis as part of demonstration using this app are as follows:

  • Maximum number of deaths reported among other states till Aug 29 2020. Image 4

  • Maximum number of Recovery reported among other states till Aug 29 2020. Image 5

  • Effective Increases in covid-19 Cases for all states per day Image 1

  • Minimum effective increase among other states till Aug 29 2020 Image 6

  • Effective Increases in covid-19 Cases per total tests for all states per day Image 2

  • Effective increase for state kerala Image 3

  • Effective increase per million for state of Kerala Image 4

Primary data source

We are currently using two APIs maintained by covid19india

Installation

Inorder to run this app in local system, prequisites and correct versions are required

Prequisites

  • spark version 2.4.6 compiled with scala version 2.12
  • scala 2.12
  • SBT 1.3.13 or higher
  • cassandra 4.0
  • cqlsh 5.0.1

Download and set up Cassandra and cqlsh in your local system referring apache cassandra doc here

Start cassandra service

$ sudo service cassandra start

Set up cassadnra keyspace and table.

$ cqlsh

This would open up cassadnra cqlsh session in your terminal. Now create a Keyspace named exactly as below (Keyspace and table names are hard coded in driver script. Any change would throw NoNodeFoundException by the datastax driver).

cqlsh> CREATE KEYSPACE covid19 WITH replication = {'class': 'SimpleStrategy', 'replication_factor':  '1'}  AND  durable_writes = true;

Access inside keyspace

cqlsh> USE covid19;

Create tables with appropriate partition key

cqlsh: covid19> CREATE TABLE state_data(property text, state_code text, state_value float, date date, PRIMARY KEY (property, state_code, date));

Installation is complete. you can stop cassandra service

$ sudo service cassandra stop

Running locally

  • git clone from master
  • Rename sample-cassandra.conf inside src/main/resources folder to application.conf. Update correct values under local_cassandra object.
  • start cassandra service
$ sbt compile
$ sbt package
$ sbt run local

Here App would start running in local machine. Fist fetching data from API, processing and finally writing to Cassandra. You can verify by logging into cqlsh and executing following

cqlsh> SELECT * FROM covid19.state_data LIMIT 100;

Running on Amazon EMR Cluster with Amazon Keyspace

  • Create Amazon AWS account and create an EMR instance referring this AWS Doc here
  • Set up Amazon Keyspace using this doc here
  • Rename sample-cassandra.conf inside src/main/resources folder to application.conf. Update correct values under amazon_cassandra object.
  • Go to project folder in your local system and build JAR file.
sbt assembly
  • SSH into EMR master node instance and set up cassandra trustore file.

This would generate a covid19-assembly-0.1.0-SNAPSHOT.jar file in src/target folder.

  • Create an Amazon S3 bucket referring to doc here
  • Upload covid19-assembly-0.1.0-SNAPSHOT.jar to S3 bucket.
  • Start and ssh to EMR instance and download jar file from S3 bucket.
aws s3 cp your_s3_path ./
  • execure spark submit commanf
spark-submit covid19-assembly-0.1.0-SNAPSHOT.jar aws

This would run the spark app and writing data to Amazon Keyspaces.

About

This Spark App analyses various covid cases data and enables you to create custom mathematical insights using a unified data structure and a trait method. After processing data it then writes to Cassandra which is then used as primary source for Data Visualization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages