As data rarely comes ready to be used and analyzed for machine learning right away, this package aims to help speed up the process of cleaning and doing initial exploratory data analysis (EDA). The package focuses on the tasks of dealing with outlier and missing values, scaling and correlation visualization.
$ pip install -i https://test.pypi.org/simple/ eda_utils_py
The four functions contained in this package are as follows:
imputer
: A function to impute missing valuesoutlier_identifier
: A function to identify and deal with outlierscor_map
: A function to plot a correlation matrix of numeric columns in the dataframescale
A function to scale numerical values in the dataset
While Python packages with similar functionalities exist, this package aims to simplify the amount of code necessary for these functions and outputs. Packages with similar functionality are as follows:
- Please see a list of dependencies here.
The eda_utils_py package will help you in your exploratory data analysis portion of your work.
eda_utils_py includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. Depending on the function, the generated output can be obtained in object or graphical form.
import pandas as pd
from eda_utils_py import eda_utils_py
data = pd.DataFrame({
'SepalLengthCm':[5.1, 4.9, 4.7],
'SepalWidthCm':[1.4, 1.4, 1.3],
'PetalWidthCm':[0.2, 0.1, 0.2],
'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
})
data_with_NA = pd.DataFrame({
'SepalLengthCm':[5.1, 4.9, 4.7],
'SepalWidthCm':[1.4, 1.4, 1.3],
'PetalWidthCm':[0.2, 0.1, None]
})
data_with_outlier = pd.DataFrame({
'SepalLengthCm':[5.1, 4.9, 4.7, 5.2, 5.1, 5.2, 5.1, 4.8],
'SepalWidthCm':[1.4, 1.4, 1.3, 1.2, 1.2, 1.3, 1.6, 1.3],
'PetalWidthCm':[0.2, 0.1, 30, 0.2, 0.3, 0.1, 0.4, 0.5]
})
data_with_scale = pd.DataFrame({'SepalLengthCm':[1, 0, 0, 3, 4],
'SepalWidthCm':[4, 1, 1, 0, 1],
'PetalWidthCm':[2, 0, 0, 2, 1],
'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica', 'Iris-virginica','Iris-germanica']})
The eda_utils_py package contains functions that will help you to:
- Impute: Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.
imputer(data_with_NA)
Output of imputer()
:
- Identify Outliers: Identify and deal with outliers in the dataset.
outlier_identifier(data_with_outlier, method = "median")
Output of outlier_identifier()
:
- Correlation Heatmap Plotting: Easily plot a correlation matrix along with its values to help explore data.
numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
cor_map(data, numerical_columns, col_scheme = 'purpleorange')
Output of cor_map()
:
- Scaling: Scale the data in preperation for future use in machine learning projects.
numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
scale(data, numerical_columns, scaler="minmax")
Output of scale()
:
The official documentation is hosted on Read the Docs: https://eda_utils_py.readthedocs.io/en/latest/
This package is authored by Chuang Wang, Fatime Selimi, Jiacheng Wang, and Micah Kwok as part of the course project in DSCI-524 (UBC-MDS program). You can see the list of all contributors in the contributors tab.
We welcome and recognize all contributions. If you wish to participate, please review our contributing guidelines.
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.