This project contains all code produced for my master thesis: Detection of Clone-Phishing using Machine Learning.
The most important directory. It contains the code to analyze a url using the multifilter approach evaluated in the master thesis. Classification of a url is done by:
- blacklist to check if entry is still in blacklist
- lexical filter with Random Forest
- content filter with Random Forest
- signature filter to check signature by Bing
- score fusion to predict the final classification using a Decision Tree
+++ MENU FOR COMMANDS +++
- database --> print menu for database creation.
- features --> print menu for feature extraction.
- filter --> print menu for all filter.
- predict [url] --> predict url with final multi filter approach using two Random Forests and Decision Tree for score fusion
- test --> run test code from /testing.
- config --> print configuration from definitions file.
- exit --> exit the system.
Typ in the displayed command to go further in menu structure or predict a url with the final multi filter approach (98%) accuracy.
[SYSTEM] | [INFO] | [15/04/2021 15:02:25] | [Function get_f1] Precision: 0.982420554428668
[SYSTEM] | [INFO] | [15/04/2021 15:02:25] | [Function get_f1] Recall: 0.9764784946236559
[SYSTEM] | [INFO] | [15/04/2021 15:02:25] | [Function get_f1] F1: 0.9794405123019886
Type in a command:
predict https://github.com/newH1VE/Fish4Phish/edit/main/README.md
or
filter
Components contain the workflow to be done for different taks.
- comp_database: workflow to create databases for all filters
- comp_feature_extraction: workflow to create lexical/content/siganture features from created database file
- comp_feature selection: workflow to select extracted features for lexical or content based analysis using NECGT-MI
Modules are used by there coresponding components. They contain the methods to implement the workflow of the components.
- mod_database: all methods to delete, open, write files containing data
- mod_feature_extraction: all methods to extract features for lexical/content/signature filter (methods for lists or single entries are seperated)
- mod_feature_selection: all methods to select features using the neighborhood-entropy based cooperative game theory
This directory contains made configurations including the definitions of paths for data and main files or parameters of the implemented logger.
- configuration: all path/file and logging parameter definitions
- program_config: all methods to save configs to fish4phish.ini
- fish4phish.ini: configuration file that contains the date of the last update for the blacklist database
This directory implements all needed classes and enums. Classes contain variables to save all features as well as the url and label for all filters. Enums specify logging actions like informative, warning and error.
all classes for blacklist, content filter, lexical filter, signature filter, letter frequencies, logging color and to save done redirects of website.
They define different actions of the logger. Three actions are implemented:
- Informative: [INFO]
- Warning: [WARN]
- Error: [ERR]
outsourced code of main.py to make the main file slightly smaller. The main files call the workflow of the components.
- main_config: main file for menu item config
- main_databse: main file for menu item database (comp_database)
- main_features: main file for menu items feature contain functions for feature extraction and selection (comp_feature_extraction, comp_feature_selection)
- main_content: main file for content based filter (phishing_filter/ml_content)
- main_lexical: main file for lexical based filter (phishing_filter/ml_lexical)
Helpers are all functions that can not be clearly assigned to one module or are remove from a module to make the code smaller.
All methods helping the feature extraction
All methods helping other modules than feature extraction
All methods to log function prints and typed commands.
Contains all print statements for menus.
This directory contains all files for the filters.
Tested single filter approach.
Implements all actions to update the blacklist or add as well as remove and check entries.
ML_Lexical and ML_content have the same structure:
- files for each machine learning modell (Random Forest: rf.py, Extreme Gradient Boosting: xgb.py, K-Nearest Neighbor: knn.py, Logistic Regression: lr.py, Support Vector Machine: svm.py, Decision Tree: dt.py, Adaptive Boosting: ab.py)
The structure for each modell is identical:
- train_model: train the modell by the passed function parameter data
- optimize: optimize model hyper parameters using randomized search
- print_scores: do cross validation for 5 splits and print produced scores
- transform_data: delete columns that don't contain features or labels (ID, URL, Final URL)
- save_last_score: save produced score to file in folder saved_scores
- load_last_score: load score from file in folder saved_scores
- save_model: save model to file in folder saved_models
- load_model: load model from file in folder saved_models
- predict_url: predict url by model
Contains the Decision Tree with structure explained above and fusion implementing majority vote and weighted majority vote.
The file signature_check is an implementation that inherits the Classifier class by sklearn to implement own classifiers that are compatible with sklearn functions. The file contains the signature based filter.
The main file of the priject that starts first and calls all explained functionalitities.