GitHub - nogibjj/atreya-tadepalli-miniproject10: Rust-based Ruff Linter

Mini Project 10 - Pyspark Usage

Introduction

In this project, I worked on harnessing PySpark to execute SQL operations. In this project, I reviewed records associated with health failure, including blood pressure status, creatinine levels, and smoking habits, to better understand biological indicators of heart failure, segmented by age.

Steps

The first step involved downloading pyspark and java appropriately within Codespaces. Once these were downloaded, the proper packages could be imported for use, such as SparkSession and requests.
The second step was to revise the methods originally used to perform SQL operations on Databricks to utilize Pyspark. I was able to revise methods such as extract and transform_load to utilize PySpark. Specifically with transform_load, I have specified the schema to upload the data into a dataframe.
Finally, I have prepared new methods to query the data and transform the data by adding new columns. These are ultimately executed in the main file, after creating a temporary view of the dataframe.

a. Query - This method involved gathering the average values for health metrics for patients of each age. Specifically, I sought to understand the average creatinine_phosphokinase, serum_creatinine, serum_sodium, and age values, for each age value represented within the dataset. Additionally, I sought to gather these average values for females within the dataset, so I specified that in the SQL command below.

'SELECT AVG(creatinine_phosphokinase), AVG(serum_creatinine), AVG(serum_sodium), AVG(age) FROM table_query WHERE sex=1 GROUP BY age ORDER BY age DESC;'

b. Add_Column' - This method involved adding a new column to the dataset, conditional on another column. In this case, I performed a filter and set up a new variable to indicate if a patient had a low platelet count, or under 150,000 platelets (1), or if they exceeded that threshold. The below image shows the addition of one more column, low_platelet_count`, within the dataset.

Results

The below image displays the results of the SQL operation. Additionally, this is also shared within the markdown file generated.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
mylib		mylib
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
repeat.sh		repeat.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
spark-3.4.4-bin-hadoop3.tgz		spark-3.4.4-bin-hadoop3.tgz
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Project 10 - Pyspark Usage

Introduction

Steps

Results

About

Releases

Packages

Languages

License

nogibjj/atreya-tadepalli-miniproject10

Folders and files

Latest commit

History

Repository files navigation

Mini Project 10 - Pyspark Usage

Introduction

Steps

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages