Presentation

The main purpose of this package is to provide a Python AutoML class named AML that covers the complete pipeline of a binary classification project from a raw dataset to a deployable model. It can be used as a functions/classes catalogue to ease and speed-up Data Scientists repetitive dev tasks aswell.

You can find the whole code documentation on MLBG59 Readthedoc documentation

Getting Started

Prerequisites

Python 3.7
pandas 1.0.1
torch
xgboost

Installation

Since this package is uploaded to PyPI, it can be installed with pip using the terminal :

$ pip install AutoMxL

AML class tutorial

AML is built as a class inherited from pandas DataFrame. Each Machine Learning step corresponds to a method that can be called with default or filled parameters.

Note : For each method, verbose parameter allows you to get logging informations.

Import and target encoding

If needed, you can find in Start sub-package functions that facilitate data loading and target encoding.

# import package
from AutoMxL import *

# import data into DataFrame with delimiter auto-detection for csv and txt files
df_raw = import_data('data/bank-additional-full.csv', verbose=False)

# set "yes" category from variable "y" as the classification target.
# => get modified dataset and new target name
df, target = category_to_target(df_raw, var='y' , cat='yes')

# instantiate AML object with dataset and target name
auto_df = AML(df, target=target)

Explore

explore method gives you global information about the dataset and automatically identify features types (booleans, dates, verbatims, categoricals, numericals). This information is stored in "d_features" attribute.

auto_df.explore(verbose=False)

print(auto_df.d_features.keys())
> output : dict_keys(['date', 'identifier', 'verbatim', 'boolean', 'categorical', 'numerical', 'NA', 'low_variance'])

Preprocess

preprocess method prepares the data before feeding it to the model :

removes features with low variance and features identified as verbatims and identifiers
transforms date features to numeric data (timedelta, ...)
fills missing values
processes categorical data (using one hot encoding or Pytorch §NN embedding encoder)
processes outliers (optional)

auto_df.preprocess(process_outliers=False, cat_method='encoder', verbose=False)

Select Features (optional)

select_features method reduces the features dimension to speed up the modelisation execution time (may increase model performance aswell).

auto_df.select_features(verbose=False)

Model Train Test

model_train_test method trains and test models with random search.

creates models with random hyper-parameters combinations from HP grid
splits (random 80/20) train/test sets to fit/apply models
identifies valid models |(auc(train)-auc(test)|<0.03
gets the best model in respect of a selected metric among valid model

Available classifiers : Random Forest, XGBOOST (and bagging).

d_fitted_models, l_valid_models, best_model_idx, df_model_res = auto_df.model_train_test(verbose=False)

output :

d_fitted_models: dict containing models and information on test set
l_valid_models: valid model indexes
best_model_idx: best model index
df_model_res: models information and metrics stored in DataFrame

Note : if you prefer to train and test your model separately, you can also use the following modelisation methods:

auto_df.model_train(verbose=False)
d_fitted_models, l_valid_models, best_model_idx, df_model_res = auto_df.model_apply(df_sel, verbose=False)

Application methods

Once you have applied preprocess and select_features, you can apply the same transformations to any iso-structure dataset using following methods:

df_prep = auto_df.preprocess_apply(df, verbose=False)
df_sel = auto_df.select_features_apply(df_prep, verbose=False)

Other methods

Since AML is pandas DataFrame inherited class, you can apply any DataFrame methods on it.

Note : copy() method applied on AML object will return a DataFrame. If you need to make a copy of AML object, use duplicate() method instead.

Information

Release History

1.0.0 : First proper release

Next steps

Regression and multi-class classification

Licence

Distributed under the MIT license. See License.txt for more information

Author

Maxence Labesse - maxence.labesse@yahoo.fr

https://github.com/Maxence-Labesse/AutoMxL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Presentation

Getting Started

Prerequisites

Installation

AML class tutorial

Import and target encoding

Explore

Preprocess

Select Features (optional)

Model Train Test

Application methods

Other methods

Information

Release History

Next steps

Licence

Author

Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Presentation

Getting Started

Prerequisites

Installation

AML class tutorial

Import and target encoding

Explore

Preprocess

Select Features (optional)

Model Train Test

Application methods

Other methods

Information

Release History

Next steps

Licence

Author

Contributors