The goal of this project is to improve abusive language detection with a focus on implicit abuse. Python was used for data preprocessing, dataset builds, and SVM training. R was used to verify dataset properties (e.g. length, headers, etc.). The paper was written in and compiled from LaTeX.
This repository contains all the resources you will need to replicate results. Boosting data takes a large amount of time depending on the lexicon used; I recommend using a computer with at least 4c/4t and 16GB of memory. This is merely a recommendation and not a requirement.
This repo was originally created on October 30, 2019. I had to delete and recreate it to remove stranded Git LFS objects. RIP to 537 commits.
The publication can be viewed on the ACL Anthology website or in the Paper/
directory.
The paper was presented at the Fourth Workshop on Online Abuse and Harms (WOAH), co-located with the 2020 conference on Empirical Methods in Natural Language Processing (EMNLP 2020).
-
Update code
git pull
git submodule update --init
git pull --recurse-submodules
-
Install dependencies
- Vanilla Python:
pip3 install -r requirements.txt
- Conda:
- Select the desired conda
env
before installing (seenlpGPU_env.yml
for my NLP-focused conda env) conda install --file requirements.txt
- Select the desired conda
- Vanilla Python:
-
Clone
git clone https://github.com/danterazo/abusive-language-detection.git
-
Configure
- See the
main()
inkaggle_train.py
Variable Data Type Default Value Possible Values Purpose samples
str "all" "random", "boosted_topic", "boosted_wordbank", "all" Lets the user choose which sample types to train on. analyzer
str "word" "char", "word" Toggle n-gram analyzer. ngram_range
(int, int) (1,3) {i | i ∈ Z+} Couple (2-tuple) of lower and upper n-gram boundaries. manual_boost
[str] ["trump"] a list of strings OR None If not None, override predefined wordbanks when boosting. rebuild
bool False True, False If True, resample + rebuild training data and lexicons. The former is computationally expensive. per_sample
int 3 {i | i ∈ Z+, i > 0} Set the number of each sample type to build and train. Ignored if rebuild
is False.sample_size
int 20000 {i | i ∈ Z+, i > 0} Set the size of each dataset when building. If any set has <2000 examples, the others will be trimmed to match it. Ignored if rebuild
is False.verbose
bool True True, False Controls verbose print statements. Passed to other functions like a react prop. calc_pct
bool True True, False If True, calculate the percentage of abusive words in each sample. Uses manual, Wiegand Base, and Wiegand Extended lexicons. Very computationally expensive.
- See the
-
Train
- Once you've configured the script, simply run
kaggle_train.py
. No user input is required. python3 kaggle_train.py
- Once you've configured the script, simply run
-
Wait patiently for results
- Percentage calculation (see
calc_pct
above) is time-consuming due to regex compilation and boosting step- WIP: parallel / multithreaded calculation
- Rebuilt datasets can be found in
data/
- Class predictions can be found in
output/pred/
- Classification reports can be found in
output/report/
- Percentage calculation (see
Throughout the code I refer to our manually-tagged lexicon, based off Wiegand's base lexicon, as either manualLexicon or rds, the latter being the initials of the contributors' last names (Dante Razo, DD, Leah Schaede).
I fit what I could into this repo. The untouched train.csv
set is available upon request or here.
This file exports data for later use. Included in the repo is prebuilt data, so its not necessary to run this script.
To run, set the rebuild
flag in KaggleSVM/kaggle_train.py
to TRUE then run the script.
Builds and exports sampled training sets from large train.CSV
dataset
- Params
sample_type
(str): choose which sample types to build. "random", "boosted", or "all"boost_topic
([str]): list of strings to boost onrepeats
(int): number of datasets to build per sample typesample_size
(int): size of sampled datasets. If set too high, the smaller size will be usedverbose
(bool): verbosity flag. controls logging level
- Return
- None
- Write
- None
Quick function to import train.target+comments.TSV
+ call kaggle_preprocessing.read_data()
to format it.
- Params
- None
- Return
- Preprocessed full training dataset: (df)
- Write
- None
Call kaggle_preprocessing.sample_data()
to shuffle + cut train.CSV
down to desired sample size.
- Params
data
(df): full training data to sample fromsample_size
(int): upper bound for cutting result down to sizerepeats
(int): number of datasets to build per sample type
- Return
- None
- Write
data/train.random{i}.CSV
for indexi
Call kaggle_preprocessing.boost_data()
to boost train.target+comments.TSV
on built-in wordbank or user-defined
wordbank (passed as param manual_boost
).
- Params
data
(df): full training data to sample frommanual_boost
([str]): list of strings to boost onsample_size
(int): upper bound for cutting result down to sizerepeats
(int): number of datasets to build per sample type
- Return
- None
- Write
data/train.boosted{i}.CSV
for indexi
Wrapper function for importing lexicons. Reformats them accordingly as well; this could be considered processing but I
left it in kaggle_build.py
because it also exports them.
- Params
- None
- Return
- None
- Write
data/lexicon_wiegand/
lexicon.wiegand.base.CSV
lexicon.wiegand.expanded.CSV
lexicon.wiegand.base.explicit.CSV
lexicon.wiegand.expanded.explicit.CSV
data/lexicon_manual/
lexicon.manual.all.explicit.CSV
Another wrapper function. This calls helper functions to import and process the manually-tagged lexicons. Finally, it combines them into one DataFrame and exports it.
- Params
- None
- Return
- None
- Write
data/manual_lexicon/lexicon.manual.all
Strips unnecessary columns from my manually-tagged lexicon.
- Params
filename
(str): the name of the csv to be read
- Return
- Processed lexicon: (df)
- Write
- None
Strips unnecessary columns from DD's manually-tagged lexicon (.TSV
), then convert text classes to ints.
- Params
filename
(str): the name of the csv to be read
- Return
- Processed lexicon: (df)
- Write
- None
Strips unnecessary columns from Schaede's manually-tagged lexicon (.CSV
), then convert text classes to ints.
- Params
filename
(str): the name of the csv to be read
- Return
- Processed lexicon: (df)
- Write
- None
Writes the given DataFrame to storage.
- Params
sample_name
(str): the name of the sample; used to construct filenamedata
(df): the DataFrame to exportextension
(str): the extension to save the df as. optional; defaults to.CSV
- Return
- None
- Write
data/train.{sample_name}{i}{extension}
for indexi
A more generalized version of export_data()
. Doesn't prepend "train" to the filename and allows different filepaths.
- Params
data
(df): the DataFrame to exportsample
(str): the type of sample + part of filename; can be blanki
(int): the index + part of filename; can be blankpath
(str): the path to save the file; leave blank to save to CWDprefix
(str): the prefix of the filename (e.g. "topic", "report", etc.); can be blankindex
(bool): if TRUE, write row names; default TRUE
- Return
- None
- Write
{path}/{prefix}.{sample}{i}.CSV
This file reformats and cleans the data from kaggle_build.py
into something the SVM can use.
Reads given DataFrame line-by-line. Some comments have tabs or commas, and that can cause issues depending on the
file delimiter. Removes entries with missing values (there's only 1 in train.target+comments.TSV
without a score)
- Params
dataset
(str): filename of dataset to importverbose
(verbose): toggles print statements; default TRUE
- Return
- Clean delimited data: (df)
- Write
- None
Reads kaggle_toxic
training file and cleans it up + applies the correct header names.
- Params _ None
- Return
- None
- Write
data/src_new/kaggle-toxic_train-clean.CSV
Given a DataFrame, shuffle it and cut it down to the given size.
- Params
data
(df): data to samplesize
(int): sample size
- Return
- Sampled data: (df)
- Write
- None
Given data, return only rows containing predefined abusive words. Or, if given a wordbank, return rows containing any of those words instead.
- Params
data
(df): DataFrame to boostdata_name
(str): filename for print statements; ignored ifverbose=FALSE
verbose
(bool): controls verbosity; default TRUEmanual_boost
([str], or None): user-defined wordbank to boost on; default None
- Return
- Boosted data: (df)
- Write
- None
This file trains n
SVMs for all three sample types, with n
being the repeats
flag.
This is where the magic happens. Fits CountVectorizer, trains SVM, and prints + exports results per dataset.
- Params
rebuild
(bool): if TRUE, rebuild + rewrite the following datasets:samples
([str]): three modes: "random", "boosted", or "all"analyzer
(str): either "word" or "char". for CountVectorizerngram_range
((int,int)): tuple containing lower and upper ngram bounds for CountVectorizermanual_boost
([str]): use given list of strings for filtering instead of built-in wordbanks. Or passNone
repeats
(int): controls the number of datasets built per sample type (ifrebuild
is TRUE)verbose
(boolean): toggles print statementssample_size
(int): size of sampled datasets. If set too high, the smaller size will be usedcalc_pct
(bool): if TRUE, calculate percentage of explicitly abusive and implicitly abusive words in each sampledecimals
(int): number of decimals to round percentages to
- Return
- None
- Write
output/pred/pred.{sample_type}{i}
for indexi
and stringsample_type
, both defined in-functionoutput/stats/percent_abusive/percent.{sample_type}{i}
ifcalc_pct
is TRUEoutput/report/report.{sample_type}{i}
Helper function that queues datasets to be trained per sample. It reads n
sets for the given sample_type
- Params
sample_type
(str): part of filename, used for reading it into memoryn
(int): number of files per sample
- Return
- List of DataFrames: ([df])
- Write
- None
Helper function that checks for previously-computed y_pred
. If it exists, print it; else, compute it.
- Params
x
(df): data to predicty
(df): class vectorclf
(sklearn.pipeline.Pipeline): CountVectorizer and SVM modelsk
(int): number of folds to be used in cross-validationsample_type
(str): name of sample type; used for filename checks + exportsi
(int): index; used for filename checks + exportsverbose
(bool): used to control verbosity of import / fit steps
- Return
- None
- Write
output/pred/pred.{sample_type}{i}.CSV
if y_pred doesn't already exist for sample type and indexi
Helper function that checks for previously-computed abusive-content percentages. If it exists, print it; else, compute it.
- Params
data
(df): data to compute percentages forsample_type
(str): name of sample type; used for filename checks + exportsi
(int): index; used for filename checks + exportsverbose
(bool): used to control verbosity of import / fit steps
- Return
- None
- Write
output/stats/percent_abusive/percent.{sample_type}{i}.CSV
if percentage doesn't already exist for sample type and indexi
Wrapper, called from real main. Protects inner-scope variables in fit_data()
- Params
- None
- Return
- None
- Write
- None
If postprocessing wasn't already a word, it is now. This contains helper functions that work with data that has already been trained or processed.
This computes how much of data
is considered abusive. Uses all three lexicons: manual, Wiegand Base, and
Wiegand Extended. Returns a DataFrame with a column of lexicon names and calculated percentages.
- Params
data
(df): DataFrame to calculate abusive contents of
- Return
- DataFrame of results: (df)
- Write
- None
TODO: Utilize the multiprocessing
library for Python for parallel boosts. I got it to work but it hung up when joining jobs due to the Queue object.
This simulates 5-fold cross validation on the sampled datasets (9 total), then calculates the percentage of out-of-vocabulary words per fold.
- Params
k
(int): number of folds to use for cross-validationverbose
(bool): toggle verbosity of function
- Return
- None
- Write
stats/oov/oov.{sample_type}{i}
for indexi
and stringsample_type
, both defined in-function
Given a DataFrame, split into k
folds and return list of train + test splits
- Params
data
(df): the data to splitk
(int): number of folds to use for cross-validationstate
(int): controls the random state ofsklearn.model_selection.KFold
- Return
- List of lists of DataFrames: ([[train1, test1], [train2, test2]...])
- each train/test pair is for one fold
- List of lists of DataFrames: ([[train1, test1], [train2, test2]...])
- Write
- None
Given a DataFrame and lexicon, return two sets: the words in both the df and lexicon (used words), and the words in the lexicon but not the df (unused words)>
- Params
df
(df): the DataFrame tolex
([str])
- Return
- Set of used words ({})
- Set of unused words ({})
- Write
- None
Given a DataFrame of the correct format, create a set of words used in its comment_text
feature.
- Params
data
(df): the DataFrame to process
- Return
- Set of strings ({str})
- Write
- None
Thanks for reading, and happy training!