Skip to content

mansidhamne/llm_phishing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mail Phishing Detection

Using LLM and various baseline models, we infer whether incoming mails are spam or not spam.

This project will train a few baseline models and a RoBERTa model to perform a binary classification task on short texts. It will be structured into a training and inference pipeline.

Dataset Used:

Models Used:

Baseline Models

LLM Model:

Results:

Baseline Models:

f1 precision recall accuracy training_time inference_time
NB 0.9031007751937980 0.8503649635036500 0.962809917355372 0.9561018437225640 0.0048120021820068 0.0026149749755859
LR 0.9444444444444440 0.9306569343065690 0.9586466165413530 0.9736611062335380 0.0553030967712402 0.0003492832183837
KNN 0.8576512455516010 0.8795620437956200 0.8368055555555560 0.929762949956102 0.0007669925689697 0.1529159545898440
SVM 0.9441441441441440 0.9562043795620440 0.9323843416370110 0.9727831431079890 0.6286451816558840 0.121741771697998
XGBoost 0.8783783783783780 0.948905109489051 0.8176100628930820 0.9367866549604920 1.163405179977420 0.0014581680297851600

RoBERTa:

f1 precision recall accuracy training_time inference_time
RoBERTa 0.9686924493554330 0.9776951672862450 0.9598540145985400 0.9850746268656720 14531.401168823200 293.3671679496770

From the above metrics we can see in baseline models, highest accuracy on this dataset is provided by Logistic Regression equal to 94.44% while the LLM model RoBERTa provides 96.86% accuracy.

Examples:

Testing the best baseline model: Screenshot 2024-07-25 at 8 51 59 PM

Testing the LLM model: Screenshot 2024-07-25 at 9 11 34 PM

Project Structure:

llm_phishing
├───data
│   ├───emails.csv
├───src
|    ├───get_data.py
|    └───infer.py
|    └───main.py
|    └───preprocess.py
|    └───train.py
|    └───utils.py
│   .gitignore
└── requirements.txt

Execute the training pipeline:

python src\main.py -t train -mt <model_type> -c <dataset_name> -l <label_col_name> -n <text_col_name>

Execute the inference pipeline:

python src\main.py -t infer -mt <model_type> -c <dataset_name> -l <label_col_name> -n <text_col_name>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages