Pump It Up: Data Mining The Water Tables

CS 6375 Machine Learning Project

Team members

Ankita Patil
Abhilash Gudasi
Shiva Chawala

Project Source: DrivenData

Challenge Name: Pump It Up: Data Mining the Water Table

Goal: To predict the operating condition of waterpoints in Tanzania i.e. to determine whether the water pump is functional, non-functional or needs repair

Workflow

Data Exploration
- Univariate Analysis
- Correlation Graph
Pre-processing/Feature Engineering
Algorithm Implementation
- Model training and parameter tuning using GridSearchCV
- With train-test split
- With k-fold cross validation
With the trained model, predict the accuracy on the test data

Algorithms Implemented

Justification of selecting supervised machine learning algorithms
By exploring the datasets, we observe that for each instance, a label is provided. When data with label is provided, supervised machine learning algorithms can be applied.
Following five algorithms are used in model creation for Pump It Up: Data Mining the Water Table dataset

Logistic Regression
Support Vector Machine
Adaboosting
Neural Net
Random Forest

Folder structure as submitted :

Root(Pumpitup)
|
---Code--|
|	 |--Final--|
|		   |
|		   ---BestModel
|		   ---GridSearch
|		   ---Pre-process
|		   ---ROC
|
|
|
---Datasets--|
	     |
	     ---Pump_it_Up_Data_Mining_the_Water_Table_-_Training_set_labels.csv
	     ---Pump_it_Up_Data_Mining_the_Water_Table_-_Training_set_values.csv
	     ---test_values_processed.csv
	     ---train_values_processed.csv
	     ---heights.csv

Root folder(Pumpitup) is divided in to two folder one for code and another for Datasets.
Code-->Final:
Inside Final folder:
- BestModel contains the all five best model using different techniques we achieved in this project.
- GridSearch contains our initial exploration to find the best model trying different parameters.
- Pre-process contains one python file for generating the preprocessed data and one for generating the missing gps_heights(one of the attribute in the given dataset). The preprocessing python file will generate the processed csv file inside Dataset folder and the other python file will compute the missing gps_height values and generate heights.csv file inside Dataset folder.
- ROC contains the python file for all best models output ROC curve generation code.

Datasets: This folder contains the dataset(values and labels) we got from DataDriven competition website for PumpitUp problem. This also contains the preprocessed datasets.

To run the code: If the same folder structure is maintained as mentioned above
- Preprocessing: Since we are displaying various plots in this, you cannot run it in command line. So recommended to run the file in Jupyter using the PumpItUpPreprocessing.ipynb file.
Running Best Models: Go inside BestModel folder and run the below command:
```
 Syntax: python Modelname.py

 Ex: python AdaBoostBest.py
 	    python DeepLearningBest.py
 	    python LogisticRegressionBest.py
 	    python RandomForestBest.py
 	    python SVMBest.py
```
After running above command in command line you will see
-->Confusion matrix,
-->Classification report,
-->Accuracy of the model using train-test split and
-->Accuracy of the model using kfold cross validation outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Code/Final		Code/Final
Datasets		Datasets
Documentation		Documentation
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code/Final

Code/Final

Datasets

Datasets

Documentation

Documentation

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Pump It Up: Data Mining The Water Tables

Workflow

Algorithms Implemented

About

Releases

Packages

Languages

License

patilankita79/PumpItUp_DataMiningTheWaterTables

Folders and files

Latest commit

History

Repository files navigation

Pump It Up: Data Mining The Water Tables

Workflow

Algorithms Implemented

About

Resources

License

Stars

Watchers

Forks

Languages