GitHub - MoSadri/Thesis_2023: Thesis-McGill

Note

This is an extension of the speech classifier program developed by Thomas Davidson et al.. As Thomas Davidson's repository is no longer maintained, we decided to create our own, added modifications, and tested it with different datasets, including speech data from Berkeley

There are 7 files in the "speech_classifier" folder:

generate_group_csv.py
count_groups.py
generate_trained_model.ipynb
speech_classifier.py
generate_cv_data.py
run_cross_validation.py
run_all_scenarios.py

The "data" folder referenced below is available in data.zip v1.0.0
Note that the files in the input folder are aligned with the original analysis data set of TDavidson (i.e., labeled_data.csv). In addition, the data entries are labeled with target groups in the files in the input folder.

All .py files can be executed from the speech_classifier folder with a simple command like this: python program_name.py

The first program: generate_group_csv.py reads the speech data available in the "data" folder (provided by Berkeley) and selects the desired number of groups to be analyzed. Different scenarios can be created for testing by altering the number of each targeted group in the file according to the "data_name" variable specified. For simplicity in our description, we will use the "balanced" data name as an example.

Input: ../data/berkeley_speech_dataset.csv
Output: ../data/balanced_dataset.csv

The second program: count_groups.py is a simple script to print out the actual number of groups resulting from the first program.
As there are overlaps between different targeted groups, it is useful to obtain the percentage of each targeted group produced.

Input: ../data/balanced_dataset.csv
Output: ../output/balanced_groups_counts.txt

The third program: generate_trained_model.ipynb is a Jupyter Notebook script to generate the trained model that needs to be passed into the speech classifier program.
This program will produce 5 files with a .pkl extension, which need to be passed to speech_classifier.py.

Data folder: Should be placed in the top-level directory(root directory)
Input: ../data/balanced_dataset.csv
Output:

../data/balanced_model.pkl
../data/balanced_tfidf.pkl
../data/balanced_idf.pkl
../data/balanced_pos.pkl
../data/balanced_oth.pkl

The file: speech_classifier.py is the actual speech classifier program. It analyzes all the speech files in the input folder and produces the number of hate speech instances detected in the input files. It is currently also set to analyze a pre-labeled data file named "labeled_data.csv."
This file was created by TDavidson to test the program's performance. We use this same file to determine the accuracy, precision, recall, and F1 score of our program.

Trained model files:

../data/balanced_model.pkl
../data/balanced_tfidf.pkl
../data/balanced_idf.pkl
../data/balanced_pos.pkl
../data/balanced_oth.pkl

Input CSV files to be analyzed: all files located in ../input
Output: The output is located in ../output. The program will produce one output file for every file it finds in the input directory, listing the predicted class for each tweet/text within each file.
It will also generate two PDF files and two TXT files. The PDF files are named "original_hate_vs_balanced_hate.pdf" and "original_hate+offensive_vs_balanced_hate.pdf" and contain confusion matrices based on the analysis of the original analysis data set "labeled_data.csv." The TXT files are named "original_hate_vs_balanced_hate.txt" and "original_hate+offensive_vs_balanced_hate.txt" and contain quality scores of the classifier program based on the analysis of the input CSV files.

We concentrate on just two classes, "hate" or "not hate" in our program (and designed our trained dataset accordingly). Therefore, we produce the file "original_hate_vs_balanced_hate.pdf" by considering all "Offensive" class instances in "labeled_data.csv" as incorrectly classified. The second file, "original_hate+offensive_vs_balanced_hate.pdf," treats all "Hate" and "Offensive" class instances the same as "Hate," resulting in higher accuracy.

The file run_all_scenarios.py is an automation of the execution of speech_classifier.py for running the 4 pre-configured scenarios, which are black, women, lgbt and balanced scenario. This program will output a table in csv format containing the quality scores of these scenarios.

Input: speech_classifier.py
Output: ../output/full_table.csv

Cross Validation: There are two files

generate_cv_data.py and run_cross_validation.py.

These two files are used for performing k-fold cross-validation.
Currently, k is set to 5, but users can set it to different numbers to perform k-fold cross-validation.

The steps are:

python generate_group_csv.py
python generate_cv_data.py
python run_cross_validation.py

The first program: generate_group_csv.py is used to produce training data with the desired number for each targeted group. This is the same program as before.
- Input: ../data/berkeley_speech_dataset.csv
- Output: ../data/balanced_dataset.csv
The second program: generate_cv_data.py is used to split the files into k+1 pieces, in preparation for k-fold validation. With one piece used as analysis dataset, and 5 other pieces used as training datasets. This step will be repeated k times so each piece of data will have its chance to be the analysis dataset. This program will also generate the .pkl files needed for each trained dataset.
- Input: ../data/balanced_dataset.csv
- Output: Analysis sets: ../cv_data/balanced_cvanalysis_fold1.csv ../cv_data/balanced_cvanalysis_fold2.csv ... ../cv_data/balanced_cvanalysis_foldk.csv
  Training sets: balanced_cvtrain_fold1.csv balanced_cvtrain_fold2.csv ... balanced_cvtrain_foldk.csv
  Pkl files: balanced_cvtrain_fold1_idf.pkl balanced_cvtrain_fold1_model.pkl balanced_cvtrain_fold1_oth.pkl balanced_cvtrain_fold1_pos.pkl balanced_cvtrain_fold1_tfidf.pkl ... balanced_cvtrain_foldk_tfidf.pkl
Program for running the actual cross-validation:
run_cross_validation.py will run through all the analysis sets and training sets for each fold and generate quality scores for each file. These quality scores include accuracy, precision, recall, and F1 score.
Input: "Analysis sets" and "Training sets" generated by generate_cv_data.py
Output: ../cv_output/balanced_cvresults_fold1.txt, ../cv_output/balanced_cvresults_fold2.txt, ... ../cv_output/balanced_cvresults_foldk.txt

Important

Data Used

Our program uses the newest Python version (3.11 at the time of our testing), which is an update from version 2.7 in the original TDavidson run. We obtained the trained data files (the .pkl files) from TDavidson's repository, repickled so they can be used in our program. These files are prefixed with "original_" and located in the data folder. Unfortunately, these files are trained models, and the CSV files used for generating these files aren't available, so it is not possible to modify these files.

TDavidson also uses an analysis set named "labeled_data.csv," which is a set of tweets with manually labeled classes ("Hate," "Offensive," or "Neither").

Important

In order to test with different data, we downloaded a new set of data from Berkeley researchers (link), and name it "berkeley_speech_dataset.csv", which is put into the data folder as well.

We use the "berkeley_speech_dataset.csv" to create three different scenarios to observe the effectiveness of this speech classifier program.

Scenario 1: csv files with tweets targeting mostly black
Scenario 2: csv files with tweets targeting mostly women
Scenario 3: csv files with tweets targeting mostly LGBT
Scenario 4: csv files with tweets targeting a balanced group (including black, women and LGBT group)

Note

Different scenarios can be created by setting different numbers when running the program "generate_group_csv.py". Here are our configurations:

Scenario	Configuration
Configuration for scenario 1	9000 black, 500 women, 200 trans, 150 gay, 150 lesbian
Configuration for scenario 2	9000 women, 500 black, 200 trans, 150 gay, 150 lesbian
Configuration for scenario 3	15000 LGBT, 3300 black, 3300 women
Configuration for scenario 4	3300 black, 3300 women, 2800 trans, 100 gay, 500 lesbian

Results The results table presented summarizes the performance metrics obtained after running cross-validation on different target groups.

Execute run_all_scenarios.py to get the results table below

Input: Consists of four sets of scenarios as training datasets, followed by the analysis datasets
Output: Include the number of hate speech instances detected, among other things, plus a compiled table in CSV format for the three scenarios

Scenario	Target Group	Accuracy	Precision (Hate)	Recall (Hate)	F1 Score (Hate)
Black	Black	67%	91%	65%	76%
Black	Women	74%	94%	71%	81%
Black	LGBT	84%	95%	86%	90%
Women	Black	62%	96%	55%	70%
Women	Women	75%	96%	71%	81%
Women	LGBT	70%	96%	68%	76%
LGBT	Black	70%	87%	72%	79%
LGBT	Women	77%	87%	83%	85%
LGBT	LGBT	87%	92%	93%	93%
Balanced	Black	68%	94%	64%	76%
Balanced	Women	83%	94%	84%	89%
Balanced	LGBT	85%	95%	88%	91%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases 1

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
cv_data		cv_data
cv_output		cv_output
input		input
output		output
speech_classifier		speech_classifier
README.md		README.md
requirements.txt		requirements.txt

MoSadri/Thesis_2023

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Packages