Autoencoder-based Feature Selection for Diabetic Retinopathy Risk Factors

This project was done under the guidance of Dr. Sundaresan Raman, BITS Pilani, as part of the course CS F376 (Design Oriented Project) in the Second Semester of AY 21-22.

Brief Description

We explored three methods to identify certain Risk factors for Diabetic Retinopathy (DR) of Primary or Secondary Importance. Of these 3 approaches, we found most success with the Autoencoder using the SN-DREAMS dataset for DR.

Dataset

The SN-DREAMS dataset (Dataset Link) contains 13 risk factors (columns) and the 14th column as an indicator of DR. Of these 13, 4 factors are categorical, while 9 are continuous. This data is available for 1555 patients (rows). However, the data is imablanced, since rows with DR = 1 are sparse. To combat this SMOTE-ENN and standardization are used.

Furthermore, we use Expert-Labeled Primary/Secondary clusters (File Link) as the 'Gold-Truth' to evaluate the results from each approach.

Attempt 1: Clustering

The first approach was simply to use K-Means Clustering (k=2, Initialization= k-means++) from the sklearn library. Further, T-SNE was also used for visualization (See below). However, nearly half the predictions were wrong and the classification didn't match the True labels.

K-Means Clustering and the resulting Confusion Matrix

Attempt 2: Classification

We used a 70:30 Train-Test split and KNN classification (k=5, Minkowski Distance). We used the ROC-AUC Scores of each of the 13 features as a metric. If the score lied above a threshold, the cluster was classified as primary, else it was classified as secondary. However, 8 of the 13 predictions were incorrect.

KNN Classification and the resulting Confusion Matrix

Attempt 3: Autoencoder

Again, we used a 70:30 Train Test split and Standard Scaler. The Autoencoder had 2 Fully Connected (Dense) layers. These were the Code layer with 7 neurons and the Output layer with 14 neurons. Ultimately, the neural network had 217 parameters in total. The Autoencoder was trained using the Adam Optimizer for 15 epochs using Mmean Absolute Error as the metric.

Autoencoder Structure and Parameters

The weights learnt by the Hidden (Code) Layer were used to assign a score to each of the 13 risk factors. The median of these scores was used as a threshold and the factors with scores above it were assigned "Primary" and the others were "Secondary", as shown below. Since the training process is non-deterministic, there was variation in the results. On a good run, 9 or 10 of the risk factors are classified correctly. These weights could be saved and reloaded, rather than train the neural network each time.

Autoencoder Feature Selection (Incorrect classifications in red)

References

Khalid, S., Prieto-Alhambra, D. Machine Learning for Feature Selection and Cluster Analysis in Drug Utilisation Research. Curr Epidemiol Rep 6, 364–372 (2019). (Paper Link)
Please refer to the PPTs for a more detailed analysis of the three Feature Selection methods and their results.
- (PPT 1: Feature Selection (3 methods))
- (PPT 2: Autoencoder for Feature Selection)

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Assets		Assets
1_Cluster.ipynb		1_Cluster.ipynb
2_Classify.ipynb		2_Classify.ipynb
3_Autoencoder .ipynb		3_Autoencoder .ipynb
Actual.xlsx		Actual.xlsx
Modified.csv		Modified.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assets

Assets

1_Cluster.ipynb

1_Cluster.ipynb

2_Classify.ipynb

2_Classify.ipynb

3_Autoencoder .ipynb

3_Autoencoder .ipynb

Actual.xlsx

Actual.xlsx

Modified.csv

Modified.csv

README.md

README.md

Repository files navigation

Autoencoder-based Feature Selection for Diabetic Retinopathy Risk Factors

Brief Description

Dataset

Attempt 1: Clustering

Attempt 2: Classification

Attempt 3: Autoencoder

References

About

Releases

Packages

Languages

Aadit3003/Diabetic-Retinopathy_Autoencoder

Folders and files

Latest commit

History

Repository files navigation

Autoencoder-based Feature Selection for Diabetic Retinopathy Risk Factors

Brief Description

Dataset

Attempt 1: Clustering

Attempt 2: Classification

Attempt 3: Autoencoder

References

About

Topics

Resources

Stars

Watchers

Forks

Languages