A Final Project for Purwadhika Data Scientist Course
This Repository is about exploring and predicting red wine quality. The dataset used in this project can be found here
The dataset consist of 11 input variables based on physicochmical analysis of red wines and 1 output variable.
The input variables are:
- Fixed acidity (g tartaric acid/dm3) : Non volatile acid. Tartaric, malic, citric, and succinic are the most common to be found in wines
- Volatile acidity (g acetic acid/dm3) : Volatile acid. Acetic acid is one of them, and is unwanted in wine due to unpleasant odors
- citric acid (g/dm3) : One of the fixed acid in wine, commonly found in citrus. Has a refreshing taste
- residual sugar (g/dm3) : The remaining sugar content in wine after fermentation. less than 1 g. liter is uncommon
- chlorides (g sodium chloride/dm3) : indicator for the amount of salt in wine. Too high of chloride usually means salty taste than can deter consumers
- free sulfur dioxide (mg/dm3) : Free form of SO2. SO2 concentration more that 50ppm can affect wine taste as a rotten egg smell is a signature of SO2 gas
- total sulfur dioxide (mg/dm3) : The total amount of SO2, bound and free. A small amount of sulfur dioxide is important as antimicrobial agent in wine
- density (g/dm3) : the density of the liquid. The density of water is close to one. Alcohol concentration, sugar content and otther dissolved compounds affect density
- pH : The power of hydrogen, commonly used as acidity scale, with 7 as neutral, below 7 indicate acidity and above 7 indicate basicity. Most wines pH are between 3-4 on the pH scale
- sulphates (g potassium sulphate/dm3): Sulfur based additive used as antimicrobial and anti-oxidant agent. Contribute to SO2 levels
- alcohol (%) : The alcohol content in wine. It level depends of the initial sudar content before fermentation and the yeast used for fermentation.
The output variable is:
12. quality : Based on sensory test. The score is between 0 and 10.
The steps I took to do the predictions are:
- Exploratory data analysis
- Preprocessing. Which mainly about outliers handling
- Model selections. The models used are all regression and regressor models
- Deployment with flask.
The best model evaluation (least RMSE and highest R square) was actually with Random Forest Regressor, but Random Forest result cannot be extrapolated, therefore, I use Ridge regression instead, a model known for handling multicoliearity, which is prevalent in my data. The best data set was after dropping outliers, with initial RMSE 10.62% and R square 0.35.
I did Hyperparameter tunning, with result as the following:
{'alpha': 0.5, 'fit_intercept': True, 'max_iter': None, 'solver': 'sparse_cg'}
The tunning results improved the model to RMSE 10.57 and R square0.36.
Flask was used to create the dashboard. There are 5 pages: