Santander Customer Transaction Prediction

🖥 View the live application

A project designed for showcasing the streamlining and automation of machine learning projects. It integrates modern MLOps practices, including continuous integration (CI), continuous deployment (CD), and automated machine learning model evaluation, training, and model deployment.

Features

Main

Machine Learning Pipeline: Incorporates a scikit-learn pipeline for training a Random Forest Classifier, including custom feature engineering steps.
Model Evaluation and Deployment: Automates model evaluation against predefined metrics and deploys the model and application to Google Cloud Run if performance thresholds are met.
Frontend Application: a Streamlit app allowing for file uploads and displaying prediction results.

Development

Automated CI/CD Pipelines: with GitHub actions and Google Cloud Build.
Pre-commit Hooks: To ensure code quality and consistency, these automatically run a series of checks before each commit to fix common issues early in the development process.

Project Structure

The project is structured as follows:

santander-mlops/
├── backend/
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── conftest.py
│   │   ├── test_api.py
│   │   └── test_utils.py
│   ├── __init__.py
│   ├── api.py
│   ├── config.py
│   ├── requirements.txt
│   └── utils.py
├── deployments/
│   ├── cloud-build/
│   │   ├── santander-backend.yaml
│   │   └── santander-frontend.yaml
│   └── cloud-run/
│       ├── santander-backend.yaml
│       └── santander-frontend.yaml
├── frontend/
│   ├── __init__.py
│   ├── app.py
│   └── requirements.txt
├── img/
│   ├── api_docs.png
│   └── frontend.png
├── ml/
│   ├── data/
│   ├── models/
│   │   └── model.joblib
│   ├── __init__.py
│   ├── evaluate.py
│   ├── feature_engineering.py
│   ├── requirements.txt
│   ├── train.py
│   └── utils.py
├── notebooks/
│   └── random_forest.ipynb
├── scripts/
├── README.md
├── backend.dockerfile
├── docker-compose.yml
├── frontend.dockerfile
├── pyproject.toml
└── requirements-dev.txt

Setup

Clone the repository:

git clone https://github.com/shaleenb/santander-mlops.git
cd santander-mlops

Download the dataset from Kaggle and place the extracted files in the ml/data directory. This can also be done using the Kaggle API:
```
# Install the Kaggle API
pip install kaggle

# Download the dataset
kaggle competitions download -c santander-customer-transaction-prediction
```
NOTE:
- You will need to accept the competition rules on the Kaggle website to download the dataset.
- If you are using the Kaggle API, you will also need to set up your Kaggle API credentials by following the instructions.
Set up the Machine Learning Environment:
```
pip install -r ml/requirements.txt
```
It is recommended to use a python virtual environment to avoid conflicts with system packages. You can create a virtual environment using the following command:
```
python -m venv .venv
source .venv/bin/activate
```
Build the Docker Images:
```
docker-compose build
```
Launch the Docker Containers:
```
docker-compose up
```

The frontend application will be available at http://localhost:8501.
The backend API will be available at http://localhost:8000.
You can access the API documentation at http://localhost:8000/docs.

Usage

Training the Model

# Navigate to the ml directory
cd ml

# Run the training script
python train.py --data-file-path data/train.csv --model-file-path models/model.joblib --id-column ID_code

You can modify the training script to include additional preprocessing steps, feature engineering, and hyperparameter tuning.

Evaluating the Model

# Navigate to the ml directory
cd ml

# Run the evaluation script
python evaluate.py --data-file-path data/test.csv \
--model-file-path models/model.joblib \
--id-column ID_code

This script will output the model's F1 Score and AUC-ROC score on the given dataset.

Frontend Application

The frontend application is a Streamlit app that allows users to upload a CSV file and receive predictions from the trained model.

API

The backend API provides a single endpoint for making predictions using the trained model.

The API documentation is available at the /docs endpoint.

The API can also be accessed using command line tools like curl:

curl -k -X 'POST' \
  'https://santander-backend-jlgkdezfva-em.a.run.app/predict?response_format=csv' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@<FILE_PATH>;type=text/csv'

It can also be accessed using Python's requests library:

import requests

with open(file_path, 'rb') as file:
    response = requests.post(
        'https://santander-backend-jlgkdezfva-em.a.run.app/predict?response_format=json',
        files={'file': file},
    )
    predictions = response.json()

Continuous Integration and Deployment

This project uses GitHub Actions and Google Cloud Build for CI/CD. The workflows are defined in .github/workflows/, with separate workflows for continuous integration and continuous deployment.

CI Workflow: Runs on every push to main and on pull requests, executing linting, testing, and building Docker images. CD Workflow: Triggers when a new tag is pushed to the repository, evaluating the model and deploying the application to Google Cloud Run if the model meets predefined performance thresholds.

Tools and Frameworks used

FastAPI
- Minimal boilerplate and very quick to set up.
- Quite fast for a Python framework.
- It's asynchronous and that may come in handy later in the project.
Streamlit
- Easiest and fastest way to build a simple UI for someone who doesn't know how to build a UI.
Google Cloud Run
- Can deploy containerised applications with minimal extra effort.
- Serverless. Saves costs when not running.
- Supports concurrent requests and can autoscale to thousands of instances.
- Makes continuous deployment easy with Cloud Build Triggers.
Typer
- It's like FastAPI, but for CLIs.

Future Work

Add MLFlow for model tracking and experiment management
Add model monitoring and alerting using Prometheus, Grafana and Evidently
Use monitoring metrics to trigger retraining and deployment of the model
Add API authentication
Store model binary in a cloud storage bucket and load it from there
Use poetry for dependency management
Improve CI/CD
- Fail pipelines if tests fail

Notes

I referred to the EDA from gpreda's notebook to save time.
I considered using pandas-profiling but given the number of columns, it would have been too slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Santander Customer Transaction Prediction

Features

Main

Development

Project Structure

Setup

Usage

Training the Model

Evaluating the Model

Frontend Application

API

Continuous Integration and Deployment

Tools and Frameworks used

Future Work

Notes

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
backend		backend
deployments		deployments
frontend		frontend
img		img
ml		ml
notebooks		notebooks
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
backend.dockerfile		backend.dockerfile
docker-compose.yml		docker-compose.yml
frontend.dockerfile		frontend.dockerfile
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

shaleenb/santander-mlops

Folders and files

Latest commit

History

Repository files navigation

Santander Customer Transaction Prediction

Features

Main

Development

Project Structure

Setup

Usage

Training the Model

Evaluating the Model

Frontend Application

API

Continuous Integration and Deployment

Tools and Frameworks used

Future Work

Notes

About

Topics

Resources

Stars

Watchers

Forks

Languages