Skip to content

Autoencoder-based Feature Selection for the SN_DREAMS diabetic retinopathy dataset. (With Prof. S. Raman)

Notifications You must be signed in to change notification settings

Aadit3003/Diabetic-Retinopathy_Autoencoder

Repository files navigation

Autoencoder-based Feature Selection for Diabetic Retinopathy Risk Factors

This project was done under the guidance of Dr. Sundaresan Raman, BITS Pilani, as part of the course CS F376 (Design Oriented Project) in the Second Semester of AY 21-22.

Brief Description

We explored three methods to identify certain Risk factors for Diabetic Retinopathy (DR) of Primary or Secondary Importance. Of these 3 approaches, we found most success with the Autoencoder using the SN-DREAMS dataset for DR.

Dataset

The SN-DREAMS dataset (Dataset Link) contains 13 risk factors (columns) and the 14th column as an indicator of DR. Of these 13, 4 factors are categorical, while 9 are continuous. This data is available for 1555 patients (rows). However, the data is imablanced, since rows with DR = 1 are sparse. To combat this SMOTE-ENN and standardization are used.

Furthermore, we use Expert-Labeled Primary/Secondary clusters (File Link) as the 'Gold-Truth' to evaluate the results from each approach.

Attempt 1: Clustering

The first approach was simply to use K-Means Clustering (k=2, Initialization= k-means++) from the sklearn library. Further, T-SNE was also used for visualization (See below). However, nearly half the predictions were wrong and the classification didn't match the True labels.

K-Means Clustering and the resulting Confusion Matrix

Attempt 2: Classification

We used a 70:30 Train-Test split and KNN classification (k=5, Minkowski Distance). We used the ROC-AUC Scores of each of the 13 features as a metric. If the score lied above a threshold, the cluster was classified as primary, else it was classified as secondary. However, 8 of the 13 predictions were incorrect.

KNN Classification and the resulting Confusion Matrix

Attempt 3: Autoencoder

Again, we used a 70:30 Train Test split and Standard Scaler. The Autoencoder had 2 Fully Connected (Dense) layers. These were the Code layer with 7 neurons and the Output layer with 14 neurons. Ultimately, the neural network had 217 parameters in total. The Autoencoder was trained using the Adam Optimizer for 15 epochs using Mmean Absolute Error as the metric.


Autoencoder Structure and Parameters


The weights learnt by the Hidden (Code) Layer were used to assign a score to each of the 13 risk factors. The median of these scores was used as a threshold and the factors with scores above it were assigned "Primary" and the others were "Secondary", as shown below. Since the training process is non-deterministic, there was variation in the results. On a good run, 9 or 10 of the risk factors are classified correctly. These weights could be saved and reloaded, rather than train the neural network each time.


Autoencoder Feature Selection (Incorrect classifications in red)

References

About

Autoencoder-based Feature Selection for the SN_DREAMS diabetic retinopathy dataset. (With Prof. S. Raman)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published