Machine Learning model development for a transport company, the objective is to predict whether an order will arrive on time or not.
We are part of a logistics company that works for an important E-Commerce portal, and our Team Leader gives us the task of implementing a model that allows us to predict whether a shipment will arrive on time or not, according to the information contained in the dataset.
The main dataset is a version of Kaggle E-Commerce Shipping Data. This dataset contains the following information:
- ID: ID Number of Customers.
- Warehouse block: The Company have big Warehouse which is divided in to block such as A,B,C,D,E.
- Mode of shipment:The Company Ships the products in multiple way such as Ship, Flight and Road.
- Customer care calls: The number of calls made from enquiry for enquiry of the shipment.
- Customer rating: The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best).
- Cost of the product: Cost of the Product in US Dollars.
- Prior purchases: The Number of Prior Purchase.
- Product importance: The company has categorized the product in the various parameter such as low, medium, high.
- Gender: Male and Female.
- Discount offered: Discount offered on that specific product.
- Weight in gms: It is the weight in grams.
- Reached on time: It is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time.
Recall of the Confusion Matrix will be used as a method for evaluating model performance. Our main interest is to find those shipments that will not arrive on time. The recall will answer the question: What percentage of shipments that do not arrive on time are we able to identify?
where
Accuracy is a metric also based on the confusion matrix. In this case we will take this metric to evaluate the classification performance for both class 1 and class 0 in our target variable. Note that in this exercise the primary class will be class 1, i.e. those shipments that do not arrive on time.
where
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- First Modeling Batch (Working with raw data)
- Second Modeling Batch (Aplying One hot Encoding)
- Third Modeling Batch (Evaluating StandardScaler)
- Fourth Modeling Batch (Evaluating Dimension Reduction using PCA)
- Final model selection and searching for best hyperparameters with GridSearchCV
- Conclusions
For more deep information please don't hesitate to open the main.ipynb.
- Sckit-Learn Documentation
- StandardScaler vs MinMaxScaler
- Video: Scaling, Normalization and Standardization (Spanish)
- Video: How to implement One Hot Encoding
Greetings, Jean Paul Fabra Ruiz: jeanfabra11@gmail.com
LinkedIn: https://www.linkedin.com/in/jeanfabra/