- Jonathan Kim
- Brendan Cheng
- Rithvik Vukka
Our team is excited to go more in depth on the ethical aspect of data science. We are especially interested in analyzing differential privacy.
Project
- Differential Privacy and Machine Learning: a Survey and Review - Literature Review
- A Novel Differential Privacy Approach that Enhances Classification Accuracy - Literature Review
- Effects of Noise on Machine Learning Algorithms Using Local Differential Privacy Techniques - Literature Review
- Differential Privacy Made Easy - Literature Review
- Signal Processing and Machine Learning with Differential Privacy - Literature Review
Our project will involve researching differential privacy and its implementation into various techniques of data analysis. The concept of differential privacy is fairly recent, and can be implemented using a variety of mechanisms depending on the desired effects on data analysis. We seek to learn some of these mechanisms, as well as their benefits, their potential downsides, and the data analysis techniques that would benefit most from their implementation. We would also like to learn about how the additional noise generated by differential privacy might compound pre-existing noise in data, and how the downsides of such noise might be mitigated. We will also attempt to implement an algorithm for differential privacy to better understand the real-world implications of differential privacy, and how they might differ from the theoretical aspects discussed in academic papers.
- We have used the iris dataset for our project. The dataset contains 150 samples of 3 different species of iris flowers. The dataset contains 4 features of each sample.
- The features are sepal length, sepal width, petal length, and petal width. The target variable is the species of the flower.
- The dataset is available on the UCI Machine Learning Repository. We choose this dataset because it is a small dataset and easy to visualize before and after adding noise to the data.
- We have implemented a simple machine learning algorithm to predict the species of the flower based on the features.
- For adding the noise to dataset we have implemented an algorithm which adds either Gaussian noise or Categorical noise depending on the feature type. This algorithm is inspired by the Effects of Noise on Machine Learning Algorithms Using Local Differential Privacy Techniques paper.
- We have used the K-Nearest Neighbors, Logistic Regression, Random Forest Classifier algorithms to predict the species of the flower.
- We have used the scikit-learn library to implement the algorithm. We have used the plotly library to visualize the data.
- We have trained the model before and after adding noise with these models and have found some results
- The accuracy of the model is not affected by the noise added to the data.
- When trained and tested on different models i.e., (trained on noise data and tested on non-noise data or vice versa) the accuracy of the model is not affected, but lower accuracy is found compared to training and testing on same dataset.
- Random Forest Classifier is found to be the best model for this dataset and it is not affected by the noise added to the data.
- Pearson correlation coefficient is not affected by the noise added to the data. Before and after adding noise the same features are found to be highly correlated with the target variable.
To run this notebook on Jupyter or Google Colab, please follow these steps:
- Clone the repository or download the notebook file.
- Upload the notebook to your Jupyter or Google Colab workspace.
- Install any required packages by running the following command in a code cell:
!pip install pandas numpy sklearn plotly
- Run the code cells in the notebook to reproduce the results.
For the other python script:
- Clone the repository or download the python script file.
- Run the python script file in your terminal or command prompt.
git clone https://github.com/CS-UCR/final-project-rbj.git
cd final-project-rbj/src
python3 file.py
Slides used for the presentation can be found here.
- We have been asked why we took only the noise factor of
0.4
into account when calculating the accuracy of the model. We have added a script that calculates the accuracy of the model with various noise factor. The results are not consistent and vary from run to run. The results of an instance are shown below.
{
0.1: 0.96,
0.2: 0.9466666666666667,
0.3: 0.94,
0.4: 0.94,
0.5: 0.9666666666666667,
0.6: 0.6066666666666667,
0.7: 0.9533333333333334,
0.8: 0.7466666666666667,
0.9: 0.62
}
on multiple runs of the algorithm we have found that the noise factor 0.4
is getting highest accuracy compared to other values. The above results are the best score of 1000 runs of the algorithm. We have used this to determine the noise factor for our algorithm before the presentation but have not put it in the presentation slides or the notebook. We have added the script to the repository now.
- We have learned about the concept of differential privacy and its implementation into various techniques of data analysis.
- We have also learned about the benefits, the potential downsides, and the data analysis techniques that would benefit most from their implementation.
- We have also learned about how the additional noise generated by differential privacy might compound pre-existing noise in data, and how the downsides of such noise might be mitigated.
- From out implemented algorithm, We found that noise added to the data is not a big factor in the accuracy of the model. We also found that the accuracy of the model is not affected by the noise factor.
- We have also learned about the real-world implications of differential privacy, and how they might differ from the theoretical aspects discussed in academic papers.