Skip to content

jasonli19/cs660-HW3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

CS660 HW 3

Group Member

  • Yan Li
  • Smith Amornsaensuk
  • Chengyi Wang

Spam Classifier using PySpark

Usage

Open

PySpark.ipynb 

in Jupyter Notebook or Google Colab and run all the cells to see the results.

Results

Laplace smoothing Value Accuracy Precision Recall F-measure Run Time
0 0.9052173913043479 0.9423076923076923 0.3161290322580645 0.47342995169082125 0:00:01.300303
0.25 0.9686956521739131 0.8287292817679558 0.967741935483871 0.8928571428571429 0:00:01.130442
0.5 0.971304347826087 0.8426966292134831 0.967741935483871 0.9009009009009008 0:00:01.052726
0.75 0.9730434782608696 0.8563218390804598 0.9612903225806452 0.905775075987842 0:00:01.095822
1 0.9756521739130435 0.8713450292397661 0.9612903225806452 0.9141104294478527 0:00:01.055738
2 0.9791304347826087 0.9171974522292994 0.9290322580645162 0.9230769230769231 0:00:01.047157

Conclusion

We can see that without smoothing factor i.e. smotting = 0 the model does a bad job on recall. The reason that model yeild high accuracy with smooth = 0 is that the unblance of classes in the data set i.e. the class prior is 0.87 and 0.13. As we increase the laplace smoothing value, the accuracy is also increase. However the recall does not change a lot with the laplace smoothing value from 0.25 to 1. But as you can see the run time did not change a lot.

Spam Classifier using MapReduce

Preprocessing

Run the following code to separate the spam data into two files named train.txt and test.txt under Data folder.

python3 preprocess.py

Training

Run the following code to get the training data which we can use in the classify step. The script will save the probability from training data in model.json file

!python3 train_mrjob.py ./data/train.txt > model.json

The script will save the probability from training data in model.json file

Classify

Run the following code to get the predicted label for the test data.

Note: By changing the parameter --laplace to see how the laplace smoothing value affect the accuracy.

!python3 classify_mrjob.py data/test.txt --model=model.json --laplace=0.1 > prediction.txt

Evaluation

Run the following code to see the results.

python3 evaluation.py

Results

Laplace smoothing Value Accuracy Precision Recall F-measure
0.1 0.9856 0.9216 0.9724 0.9463
0.25 0.9820 0.8981 0.9724 0.9338
0.5 0.9749 0.8545 0.9724 0.9097
0.75 0.9677 0.8150 0.9724 0.8868
1 0.9668 0.8068 0.9793 0.8847

Conclusion

We can get the best scores for accuracy and F-measure when laplace smoothing value equals to 0.1 using the MapReduce. However, we can get best scores when laplace smoothing value equals to 2 using PySpark. And we also found the range for laplace smoothing value is greater than 0 and less than or equal to 1 using MapReduce. But this codition does not apply to PySpark. Overall, we can get best scores by using MapReduce.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published