Data centers can largely benefit from a service that employs data mining to predict hard drive failures. Although hard drive failures are rare, they are costly occurrences. Failures in hard drives could result in temporary system unavailability and/or data loss. Hard drive manufacturers use Self-Monitoring and Reporting Technology (SMART) attributes collected during normal operations to predict failures. These SMART attributes report daily diagnostics of hard drives such as read/write error rates, spin retry count, power cycle count, etc. We used publicly available data from Backblaze, who started recording the stats of a large number of hard drives (~47000) from their own data center. In this project, we analyze and compare the performance of various machine learning algorithms (Linear Regression, Decision Tree, AdaBoost, XGBoost, Gradient Boosting, k-Nearest Neighbors and Random Forest) when used to predict hard drive failures using Backblaze data in the year 2018.
The Initial data cleaning and filtering was done in Spark. The Python Notebooks for Analysis, Data Preperation, Model Assesment and Cross-Validation have also been included.