This project focuses on building a regression model to predict crop yield ('yield') using a dataset with various agricultural metrics. We employ extensive data analysis, feature engineering, and model tuning to minimize the model's error. The final model is stored as a serialized object for easy reuse in production.
- Custom Preprocessing Pipelines: Includes transformers for feature selection, feature engineering, and outlier handling.
- Extensive Feature Engineering: New features are crafted based on domain knowledge to improve model performance.
- Pipeline Integration: A single unified pipeline to streamline preprocessing, feature engineering, and model training.
- Hyperparameter Tuning: Optimized hyperparameters for the RandomForestRegressor using advanced techniques optuna.
M5-H2-Regression-Competition/
├── notebooks/
│ ├── EDA.ipynb # Exploratory Data Analysis
│ ├── Model.ipynb # Model training and evaluation
│ └── model_explain.ipynb # Model explainability and interpretation
└── data/
│ ├── train.csv # train.csv model trained datset
│ └── test.csv # test csv
├── README.md # Project documentation
├── LICENSE # Project license (MIT)
├── requirements.txt # Required Python packages
├── model.pkl # Final model
-
Clone the repository:
git clone https://github.com/UznetDev/Wild-Blueberry-Prediction.git
-
Navigate to the project directory:
cd M5-H2-Regression-Competition
-
Install the dependencies:
pip install -r requirements.txt
To use the model stored in model.pkl
, follow these steps:
import dill as pickle
import pandas as pd
# Load the trained model
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
# Prepare your test data (ensure it matches the training data format)
# test_data must have ['seeds', 'fruitmass', 'fruitset', 'AverageOfUpperTRange']
test_data = pd.read_csv('data/test.csv')
# Make predictions
predictions = model.predict(test_data)
print(predictions)
The model pipeline integrates several custom transformers and a tuned RandomForestRegressor:
class ColumnSelector(BaseEstimator, TransformerMixin):
...
class FeatureEngineer(BaseEstimator, TransformerMixin):
...
class OutlierReplacer(BaseEstimator, TransformerMixin):
...
model = Pipeline([
('column_selector', ColumnSelector(columns=['seeds', 'fruitmass', 'fruitset', 'AverageOfUpperTRange'])),
('outlier_replacer', OutlierReplacer()),
('feature_engineer', FeatureEngineer()),
('model', RandomForestRegressor(...))
])
The dataset is loaded from train.csv
and includes features related to crop characteristics and environmental conditions. Detailed exploratory data analysis (EDA) is documented in notebook/EDA.ipynb
.
The project includes custom feature engineering steps, such as creating ratios and interactions between features (e.g., FruitToSeedRatio
, fruitset_seeds
). These are implemented within the FeatureEngineer
class.
The primary model used is a RandomForestRegressor with custom hyperparameters. The pipeline approach allows easy modification and extension of the model, making it robust for handling diverse datasets.
Hyperparameters for the RandomForestRegressor were optimized with settings such as:
max_depth=9
n_estimators=497
max_features=0.809
min_samples_split=10
min_samples_leaf=4
criterion='absolute_error'
These values were chosen to maximize model performance while preventing overfitting.
The model was evaluated using standard regression metrics, including RMSE, R^2 and MAE. Details on evaluation and insights are in notebook/Model.ipynb
.
To understand this model, we use two powerful model explainers: SHAP and Permutation Importance.
-
Permutation Importance:
- Purpose: Permutation importance measures the impact of each feature on the model’s accuracy. It works by shuffling each feature and observing how much the model’s accuracy decreases. This technique helps identify which features are most crucial to the overall performance of the model.
- Usage: We calculate the importance of each feature using
permutation_importance
fromsklearn.inspection
. - Plot: The permutation importance plot ranks features by their influence on model accuracy, making it easy to see which features are essential for the model’s performance.
-
SHAP (SHapley Additive exPlanations):
- Purpose: SHAP values explain individual predictions by showing the impact of each feature on the model’s output. This method highlights how each feature contributes to specific predictions.
- Usage: We use
shap.TreeExplainer
to analyze our model, showing the effect each feature has on the model output. - Plot: The SHAP summary plot provides a bar chart, showing the average importance of each feature across all predictions, offering insights into which features are most influential.
You can understand model in notebook/model_explain.ipynb
The model is pre-trained and saved as model.pkl
. Load and run it directly to make predictions on new data without retraining.
The final model achieved strong results on the provided dataset, making it suitable for practical yield predictions in agricultural applications.
Contributions are welcome! If you'd like to improve this project, please fork the repository and make a pull request.
- Fork the repository.
- Create a new branch for your feature or bug fix:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add a new feature"
- Push to your branch:
git push origin feature-name
- Open a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or suggestions, please contact:
- Email: uznetdev@gmail.com
- GitHub Issues: Issues section
- GitHub Profile: UznetDev
- Telegram: UZNet_Dev
- Linkedin: Abdurakhmon Niyozaliev
Thank you for your interest in this project. We hope it helps in your journey to understand and predict smoking habits using data science!