Team: RBJ

Members

Jonathan Kim
Brendan Cheng
Rithvik Vukka

Our team is excited to go more in depth on the ethical aspect of data science. We are especially interested in analyzing differential privacy.

Link/Citation of related Research Papers

Project

Differential Privacy and Machine Learning: a Survey and Review - Literature Review
A Novel Differential Privacy Approach that Enhances Classification Accuracy - Literature Review
Effects of Noise on Machine Learning Algorithms Using Local Differential Privacy Techniques - Literature Review
Differential Privacy Made Easy - Literature Review
Signal Processing and Machine Learning with Differential Privacy - Literature Review

Description

Our project will involve researching differential privacy and its implementation into various techniques of data analysis. The concept of differential privacy is fairly recent, and can be implemented using a variety of mechanisms depending on the desired effects on data analysis. We seek to learn some of these mechanisms, as well as their benefits, their potential downsides, and the data analysis techniques that would benefit most from their implementation. We would also like to learn about how the additional noise generated by differential privacy might compound pre-existing noise in data, and how the downsides of such noise might be mitigated. We will also attempt to implement an algorithm for differential privacy to better understand the real-world implications of differential privacy, and how they might differ from the theoretical aspects discussed in academic papers.

Dataset

We have used the iris dataset for our project. The dataset contains 150 samples of 3 different species of iris flowers. The dataset contains 4 features of each sample.
The features are sepal length, sepal width, petal length, and petal width. The target variable is the species of the flower.
The dataset is available on the UCI Machine Learning Repository. We choose this dataset because it is a small dataset and easy to visualize before and after adding noise to the data.

Implementation

We have implemented a simple machine learning algorithm to predict the species of the flower based on the features.
For adding the noise to dataset we have implemented an algorithm which adds either Gaussian noise or Categorical noise depending on the feature type. This algorithm is inspired by the Effects of Noise on Machine Learning Algorithms Using Local Differential Privacy Techniques paper.
We have used the K-Nearest Neighbors, Logistic Regression, Random Forest Classifier algorithms to predict the species of the flower.
We have used the scikit-learn library to implement the algorithm. We have used the plotly library to visualize the data.
We have trained the model before and after adding noise with these models and have found some results
- The accuracy of the model is not affected by the noise added to the data.
- When trained and tested on different models i.e., (trained on noise data and tested on non-noise data or vice versa) the accuracy of the model is not affected, but lower accuracy is found compared to training and testing on same dataset.
- Random Forest Classifier is found to be the best model for this dataset and it is not affected by the noise added to the data.
- Pearson correlation coefficient is not affected by the noise added to the data. Before and after adding noise the same features are found to be highly correlated with the target variable.

Running the notebook on Jupyter or Google Colab

To run this notebook on Jupyter or Google Colab, please follow these steps:

Clone the repository or download the notebook file.
Upload the notebook to your Jupyter or Google Colab workspace.
Install any required packages by running the following command in a code cell:

!pip install pandas numpy sklearn plotly

Run the code cells in the notebook to reproduce the results.

For the other python script:

Clone the repository or download the python script file.
Run the python script file in your terminal or command prompt.

git clone https://github.com/CS-UCR/final-project-rbj.git
cd final-project-rbj/src
python3 file.py

Slides

Slides used for the presentation can be found here.

Changes after Presentation

We have been asked why we took only the noise factor of 0.4 into account when calculating the accuracy of the model. We have added a script that calculates the accuracy of the model with various noise factor. The results are not consistent and vary from run to run. The results of an instance are shown below.

{
    0.1: 0.96,
    0.2: 0.9466666666666667,
    0.3: 0.94,
    0.4: 0.94,
    0.5: 0.9666666666666667,
    0.6: 0.6066666666666667,
    0.7: 0.9533333333333334,
    0.8: 0.7466666666666667,
    0.9: 0.62
}

on multiple runs of the algorithm we have found that the noise factor 0.4 is getting highest accuracy compared to other values. The above results are the best score of 1000 runs of the algorithm. We have used this to determine the noise factor for our algorithm before the presentation but have not put it in the presentation slides or the notebook. We have added the script to the repository now.

Conclusion

We have learned about the concept of differential privacy and its implementation into various techniques of data analysis.
We have also learned about the benefits, the potential downsides, and the data analysis techniques that would benefit most from their implementation.
We have also learned about how the additional noise generated by differential privacy might compound pre-existing noise in data, and how the downsides of such noise might be mitigated.
From out implemented algorithm, We found that noise added to the data is not a big factor in the accuracy of the model. We also found that the accuracy of the model is not affected by the noise factor.
We have also learned about the real-world implications of differential privacy, and how they might differ from the theoretical aspects discussed in academic papers.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src		src
summaries		summaries
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Team: RBJ

Members

Link/Citation of related Research Papers

Description

Dataset

Implementation

Running the notebook on Jupyter or Google Colab

Slides

Changes after Presentation

Conclusion

About

Releases

Packages

Contributors 3

Languages

CS-UCR/final-project-rbj

Folders and files

Latest commit

History

Repository files navigation

Team: RBJ

Members

Link/Citation of related Research Papers

Description

Dataset

Implementation

Running the notebook on Jupyter or Google Colab

Slides

Changes after Presentation

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages