Skip to content

This is a project that was done for the Skill4U machine Learning program. This project is a Spam email classifier using machine learning. This model uses Gaussian NB algorithm to train the model.

License

Notifications You must be signed in to change notification settings

Hamas-ur-Rehman/Email_Spam_Classifier

Repository files navigation

Email_Spam_Classifier

This is a project that was done for the Skill4U machine Learning program. This project is a Spam email classifier using machine learning. This model uses Gaussian NB algorithm to train the model.

Problem Statement

Spamming is one of the major and common attacks that accumulate a large number of compromised machines by sending unwanted messages, viruses, and phishing through email. We have chosen this project because now there are many people who are trying to fool you just by sending you fake e-mails

In recent figures, 40% of all mail is spam that emails about 15.4 billion emails per day and costs Internet users about $ 355 million per year. Automatic e-mail filtering is the most effective way to deal with spam at the moment

Proposed Solution

The proposed solution for this problem is to use Gaussian Naïve Bayes classifier, we have two classes to classify in either spam or ham emails. GaussianNB assumes that the data from each label is drawn from a simple Gaussian distribution. The Scikit-learn Library helps us to implement the Gaussian Naïve Bayes algorithm for classification.

Execution Plan

We have proposed the following technique in order to classify emails image

Dataset

The Dataset used to train our model was taken from Kaggle. https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset

  • This dataset contains 3 csv files each file contains 2 columns.
  • The first column is the body of the email
  • The second column contains our labels 0 for Not Spam 1 for Spam
  • Total values of the dataset of all 3 files is 18650

How the data was cleaned

We cleaned the data using NLTK library for python and vanilla python functions.

  • We balanced our dataset
  • Combined our 3 csv files into 1 dataset
  • Removed links from the dataset body column
  • Removed unnecessary symbols from our body column
  • Changed all the text into lower case
  • Performed word Tokenization
  • Used Lemmatization to remove different forms of the same words
  • Removed Stop words from our data
  • Vectorized our data By bag of words method

Algorithm

Algorithm comparison graph Details
image We are using Gaussian NB algorithm for classification. We tested out different classification algorithms and GaussianNB was giving the best results on the test data

Metrics

ROC Curve Model Evaluation
image After training and finding the best parameters we were able to get 90.07 % accuracy on our Test data
Confusion Matrix Classification Report
image image

Target Audience

About 14.5 billion spam email messages are circulated daily. That is almost 45 percent of the regular email traffic in the world. Internet Service Providers (ISPs) use spam filters to ensure they do not deliver corrupt incoming emails or links to the receiver.

Demonstration

On the left you can see how this model works. You can also try it out by scanning the QR code down below

Demo Scan to see yourself
demo image

Advantages

No more Spam Benefits
Spare_a_thought_for_your_Email_Spam_Filter - It is very effective and is also adaptive, so hard to fool. Based on text classification methods. Phenomenally accurate. Learns new spammer tactics automatically. Adapt to changing spam. It protects you

About

This is a project that was done for the Skill4U machine Learning program. This project is a Spam email classifier using machine learning. This model uses Gaussian NB algorithm to train the model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published