School of Computer Science & Engineering
Module Leader: Dr. V.S. Kontogiannis
Academic Year: 2021/22
This repository contains the coursework implementation for the module 5DATA002W: Machine Learning & Data Mining. The project focuses on clustering and regression techniques using real-world datasets and is implemented in the R programming environment.
-
Clustering Analysis
- Perform k-means clustering on a wine dataset with pre-processing steps.
- Apply Principal Component Analysis (PCA) for dimensionality reduction.
- Compare clustering performance with and without PCA.
-
Energy Forecasting
- Use a Multi-Layer Perceptron (MLP) Neural Network to predict electricity consumption based on time-series data.
- Evaluate models using statistical indices such as RMSE, MAE, and MAPE.
This coursework demonstrates:
- Preparation of realistic datasets for machine learning and data mining.
- Evaluation, validation, and optimization of models.
- Effective communication of models and analyses to diverse audiences.
-
Dataset: White wine dataset containing 4710 samples with chemical properties and quality ratings.
-
Tasks:
- Pre-process the dataset (scaling, outlier removal).
- Define the optimal number of clusters using various methods (e.g., Elbow, Gap statistics, Silhouette).
- Perform k-means clustering for (k = 2, 3, 4) and evaluate results.
- Use PCA to reduce dimensions and repeat k-means clustering.
- Compare clustering results before and after PCA.
-
Deliverables:
- R scripts and outputs for k-means clustering.
- Confusion matrix and metrics: accuracy, precision, recall.
- PCA analysis and transformed dataset clustering.
-
Dataset: Daily electricity consumption data for the University Building at 115 New Cavendish Street (2018-2019).
-
Tasks:
- Implement MLP Neural Networks using autoregressive (AR) and NARX approaches.
- Normalize input/output matrices.
- Experiment with different network structures (hidden layers, nodes, activation functions).
- Evaluate models using RMSE, MAE, and MAPE indices.
- Visualize prediction results and compare efficiency of different models.
-
Deliverables:
- R scripts for MLP implementation.
- Performance comparison tables for various models.
- Graphical plots of predictions vs actual data.
|-- datasets/
|-- whitewine_v2.xls
|-- UoW_load.xlsx
|-- src/
|-- clustering_analysis.R
|-- energy_forecasting.R
|-- results/
|-- clustering_outputs/
|-- forecasting_outputs/
|-- docs/
|-- coursework_report.pdf
|-- appendices/
|-- full_code.R
|-- README.md
- Software: R version 4.0+ and RStudio.
- R Libraries:
ggplot2
cluster
factoextra
NbClust
neuralnet
- Clone this repository:
git clone https://github.com/your-username/ml-datamining-coursework.git
- Navigate to the repository directory:
cd ml-datamining-coursework
- Install required R libraries using the provided script:
source("src/install_packages.R")
- Open
src/clustering_analysis.R
in RStudio. - Run the script to:
- Pre-process the white wine dataset.
- Perform k-means clustering.
- Apply PCA and re-run clustering.
- View outputs in the
results/clustering_outputs/
folder.
- Open
src/energy_forecasting.R
in RStudio. - Run the script to:
- Train and test MLP models using AR and NARX approaches.
- Generate statistical performance indices.
- View outputs in the
results/forecasting_outputs/
folder.
The coursework will be evaluated based on:
- Clustering implementation and results.
- MLP model development and testing.
- Discussion and justification of methodological decisions.
- Presentation of findings in the coursework report.
- Relevant literature and resources are cited within the report and code comments.
- Dataset references: Provided by University of Westminster Estates Planning & Services Department.
This project is for academic use only and is subject to University of Westminster assessment regulations.
For queries, please contact the module leader or teaching assistant via the University of Westminster Blackboard portal.