🖥 View the live application
A project designed for showcasing the streamlining and automation of machine learning projects. It integrates modern MLOps practices, including continuous integration (CI), continuous deployment (CD), and automated machine learning model evaluation, training, and model deployment.
- Machine Learning Pipeline: Incorporates a scikit-learn pipeline for training a Random Forest Classifier, including custom feature engineering steps.
- Model Evaluation and Deployment: Automates model evaluation against predefined metrics and deploys the model and application to Google Cloud Run if performance thresholds are met.
- Frontend Application: a Streamlit app allowing for file uploads and displaying prediction results.
- Automated CI/CD Pipelines: with GitHub actions and Google Cloud Build.
- Pre-commit Hooks: To ensure code quality and consistency, these automatically run a series of checks before each commit to fix common issues early in the development process.
The project is structured as follows:
santander-mlops/
├── backend/
│ ├── tests/
│ │ ├── __init__.py
│ │ ├── conftest.py
│ │ ├── test_api.py
│ │ └── test_utils.py
│ ├── __init__.py
│ ├── api.py
│ ├── config.py
│ ├── requirements.txt
│ └── utils.py
├── deployments/
│ ├── cloud-build/
│ │ ├── santander-backend.yaml
│ │ └── santander-frontend.yaml
│ └── cloud-run/
│ ├── santander-backend.yaml
│ └── santander-frontend.yaml
├── frontend/
│ ├── __init__.py
│ ├── app.py
│ └── requirements.txt
├── img/
│ ├── api_docs.png
│ └── frontend.png
├── ml/
│ ├── data/
│ ├── models/
│ │ └── model.joblib
│ ├── __init__.py
│ ├── evaluate.py
│ ├── feature_engineering.py
│ ├── requirements.txt
│ ├── train.py
│ └── utils.py
├── notebooks/
│ └── random_forest.ipynb
├── scripts/
├── README.md
├── backend.dockerfile
├── docker-compose.yml
├── frontend.dockerfile
├── pyproject.toml
└── requirements-dev.txt
-
Clone the repository:
git clone https://github.com/shaleenb/santander-mlops.git cd santander-mlops
-
Download the dataset from Kaggle and place the extracted files in the
ml/data
directory. This can also be done using the Kaggle API:# Install the Kaggle API pip install kaggle # Download the dataset kaggle competitions download -c santander-customer-transaction-prediction
NOTE:
- You will need to accept the competition rules on the Kaggle website to download the dataset.
- If you are using the Kaggle API, you will also need to set up your Kaggle API credentials by following the instructions.
-
Set up the Machine Learning Environment:
pip install -r ml/requirements.txt
It is recommended to use a python virtual environment to avoid conflicts with system packages. You can create a virtual environment using the following command:
python -m venv .venv source .venv/bin/activate
-
Build the Docker Images:
docker-compose build
-
Launch the Docker Containers:
docker-compose up
-
The frontend application will be available at
http://localhost:8501
. -
The backend API will be available at
http://localhost:8000
. -
You can access the API documentation at
http://localhost:8000/docs
.
# Navigate to the ml directory
cd ml
# Run the training script
python train.py --data-file-path data/train.csv --model-file-path models/model.joblib --id-column ID_code
You can modify the training script to include additional preprocessing steps, feature engineering, and hyperparameter tuning.
# Navigate to the ml directory
cd ml
# Run the evaluation script
python evaluate.py --data-file-path data/test.csv \
--model-file-path models/model.joblib \
--id-column ID_code
This script will output the model's F1 Score and AUC-ROC score on the given dataset.
The frontend application is a Streamlit app that allows users to upload a CSV file and receive predictions from the trained model.
The backend API provides a single endpoint for making predictions using the trained model.
The API documentation is available at the /docs
endpoint.
The API can also be accessed using command line tools like curl
:
curl -k -X 'POST' \
'https://santander-backend-jlgkdezfva-em.a.run.app/predict?response_format=csv' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@<FILE_PATH>;type=text/csv'
It can also be accessed using Python's requests
library:
import requests
with open(file_path, 'rb') as file:
response = requests.post(
'https://santander-backend-jlgkdezfva-em.a.run.app/predict?response_format=json',
files={'file': file},
)
predictions = response.json()
This project uses GitHub Actions and Google Cloud Build for CI/CD. The workflows are defined in .github/workflows/
, with separate workflows for continuous integration and continuous deployment.
CI Workflow: Runs on every push to main and on pull requests, executing linting, testing, and building Docker images. CD Workflow: Triggers when a new tag is pushed to the repository, evaluating the model and deploying the application to Google Cloud Run if the model meets predefined performance thresholds.
- FastAPI
- Minimal boilerplate and very quick to set up.
- Quite fast for a Python framework.
- It's asynchronous and that may come in handy later in the project.
- Streamlit
- Easiest and fastest way to build a simple UI for someone who doesn't know how to build a UI.
- Google Cloud Run
- Can deploy containerised applications with minimal extra effort.
- Serverless. Saves costs when not running.
- Supports concurrent requests and can autoscale to thousands of instances.
- Makes continuous deployment easy with Cloud Build Triggers.
- Typer
- It's like FastAPI, but for CLIs.
- Add MLFlow for model tracking and experiment management
- Add model monitoring and alerting using Prometheus, Grafana and Evidently
- Use monitoring metrics to trigger retraining and deployment of the model
- Add API authentication
- Store model binary in a cloud storage bucket and load it from there
- Use poetry for dependency management
- Improve CI/CD
- Fail pipelines if tests fail
- I referred to the EDA from
gpreda
's notebook to save time. - I considered using pandas-profiling but given the number of columns, it would have been too slow.