Skip to content

geekidharsh/predicting-harddrive-failures-using-ml

Repository files navigation

Predicting Hard Drive Failures Using ML

Using Backblaze dataset on Kaggle.

About this project:

This was a Data Science Case Study. Dataset used for this project is private but a similar dataset and project can also be found on Kaggle.com

Disclaimer: This case study is based on a sample subset of a larger dataset and does not accurately solve the problem. Case study is done to demonstrate the use of different tools and libraries in ML, how to present your reports, use python for ML.

Sample Dataset:

A sample of SMART hard drives dataset can be found and downloaded at: https://www.kaggle.com/backblaze/hard-drive-test-data

What are SMART systems ?

SMART features or S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) is a software monitoring system for hard drives. SMART generates a collection different metrics related to help evaluate the overall health of a Hard Drive.

A single metrics may not always determine the exact failure prediction but are commonly accepted to help identify any imminent failure and help handle the backup and restore, in time.

About this case study :

This case study relies on a given data stream provided for this purpose. The goal of this case study is to try and analyze given data and find out meaningful information that can help determine drives failure trends and different factors that may idicate if a drive would fail, and attempt to propose a more data driven answer to future failures based on SMART metrics.

The study concludes with discussing possible opportunities and challenges with existing model and features that can help design a better predictive model for future.


Solution:

Full Analysis in Jupyter Notebook

To access the entire analysis code in Jupyeter notebook, go to: Predicting Hard drive failure

Overview of the approach

Here's a quick overview of how this problem has been approached:

Extraction and Load

  1. Connect to the postgres server.
  2. Download the dataset offline

Transform

  1. Wrangle and explore
  2. Change Dimentions, clean and slice and dice

Analyze

  1. Analyze dataset, plot most significant trends

Predict:

  1. Feature Selection
  2. Model and predict

Sample report overview:

(This is Optional)

1. Number Hard Drives per model 2. Number of positive failures by model
3. Failure Trend over time 4. Daily Failure Trend to determine missing failure data pattern

and more...


Conclusion and Improvement Ideas:

  1. Conclusion
  2. Challenges with the current dataset and ways to improve it

Tech stack:

python, sql, pandas, scikit and other machine learning libaries, postgres

@ author:

@geekidharsh : I am Data Engineer with 4+ years of experience in E-commercal and Digital Acquisition. Analyzing swiftly changing user behaviors to make data driven decisions, at scale. Currently, I work at at Merck KGaA