Introduction & Research Goals

The goal of this project is to improve abusive language detection with a focus on implicit abuse. Python was used for data preprocessing, dataset builds, and SVM training. R was used to verify dataset properties (e.g. length, headers, etc.). The paper was written in and compiled from LaTeX.

This repository contains all the resources you will need to replicate results. Boosting data takes a large amount of time depending on the lexicon used; I recommend using a computer with at least 4c/4t and 16GB of memory. This is merely a recommendation and not a requirement.

The publication can be viewed on the ACL Anthology website or in the Paper/ directory.

The paper was presented at the Fourth Workshop on Online Abuse and Harms (WOAH), co-located with the 2020 conference on Empirical Methods in Natural Language Processing (EMNLP 2020).

Running the script

  1. Update code

    • git pull
    • git submodule update --init
    • git pull --recurse-submodules
  2. Install dependencies

    • Vanilla Python: pip3 install -r requirements.txt
    • Conda:
      1. Select the desired conda env before installing (see nlpGPU_env.yml for my NLP-focused conda env)
      2. conda install --file requirements.txt
  3. Clone

  • git clone
  1. Configure

    • See the main() in
      Variable Data Type Default Value Possible Values Purpose
      samples str "all" "random", "boosted_topic", "boosted_wordbank", "all" Lets the user choose which sample types to train on.
      analyzer str "word" "char", "word" Toggle n-gram analyzer.
      ngram_range (int, int) (1,3) {i | i ∈ Z+} Couple (2-tuple) of lower and upper n-gram boundaries.
      manual_boost [str] ["trump"] a list of strings OR None If not None, override predefined wordbanks when boosting.
      rebuild bool False True, False If True, resample + rebuild training data and lexicons. The former is computationally expensive.
      per_sample int 3 {i | i ∈ Z+, i > 0} Set the number of each sample type to build and train. Ignored if rebuild is False.
      sample_size int 20000 {i | i ∈ Z+, i > 0} Set the size of each dataset when building. If any set has <2000 examples, the others will be trimmed to match it. Ignored if rebuild is False.
      verbose bool True True, False Controls verbose print statements. Passed to other functions like a react prop.
      calc_pct bool True True, False If True, calculate the percentage of abusive words in each sample. Uses manual, Wiegand Base, and Wiegand Extended lexicons. Very computationally expensive.
  2. Train

    • Once you've configured the script, simply run No user input is required.
    • python3
  3. Wait patiently for results

    • Percentage calculation (see calc_pct above) is time-consuming due to regex compilation and boosting step
      • WIP: parallel / multithreaded calculation
    • Rebuilt datasets can be found in data/
    • Class predictions can be found in output/pred/
    • Classification reports can be found in output/report/


Throughout the code I refer to our manually-tagged lexicon, based off Wiegand's base lexicon, as either manualLexicon or rds, the latter being the initials of the contributors' last names (Dante Razo, DD, Leah Schaede).


I fit what I could into this repo. The untouched train.csv set is available upon request or here.


This file exports data for later use. Included in the repo is prebuilt data, so its not necessary to run this script.

To run, set the rebuild flag in KaggleSVM/ to TRUE then run the script.


Builds and exports sampled training sets from large train.CSV dataset

  • Params
    • sample_type (str): choose which sample types to build. "random", "boosted", or "all"
    • boost_topic ([str]): list of strings to boost on
    • repeats (int): number of datasets to build per sample type
    • sample_size (int): size of sampled datasets. If set too high, the smaller size will be used
    • verbose (bool): verbosity flag. controls logging level
  • Return
    • None
  • Write
    • None


Quick function to import + call kaggle_preprocessing.read_data() to format it.

  • Params
    • None
  • Return
    • Preprocessed full training dataset: (df)
  • Write
    • None


Call kaggle_preprocessing.sample_data() to shuffle + cut train.CSV down to desired sample size.

  • Params
    • data (df): full training data to sample from
    • sample_size (int): upper bound for cutting result down to size
    • repeats (int): number of datasets to build per sample type
  • Return
    • None
  • Write
    • data/train.random{i}.CSV for index i


Call kaggle_preprocessing.boost_data() to boost on built-in wordbank or user-defined wordbank (passed as param manual_boost).

  • Params
    • data (df): full training data to sample from
    • manual_boost ([str]): list of strings to boost on
    • sample_size (int): upper bound for cutting result down to size
    • repeats (int): number of datasets to build per sample type
  • Return
    • None
  • Write
    • data/train.boosted{i}.CSV for index i


Wrapper function for importing lexicons. Reformats them accordingly as well; this could be considered processing but I left it in because it also exports them.

  • Params
    • None
  • Return
    • None
  • Write
    • data/lexicon_wiegand/
      • lexicon.wiegand.base.CSV
      • lexicon.wiegand.expanded.CSV
      • lexicon.wiegand.base.explicit.CSV
      • lexicon.wiegand.expanded.explicit.CSV
    • data/lexicon_manual/
      • lexicon.manual.all.explicit.CSV


Another wrapper function. This calls helper functions to import and process the manually-tagged lexicons. Finally, it combines them into one DataFrame and exports it.

  • Params
    • None
  • Return
    • None
  • Write
    • data/manual_lexicon/lexicon.manual.all

Strips unnecessary columns from my manually-tagged lexicon.

  • Params
    • filename (str): the name of the csv to be read
  • Return
    • Processed lexicon: (df)
  • Write
    • None

Strips unnecessary columns from DD's manually-tagged lexicon (.TSV), then convert text classes to ints.

  • Params
    • filename (str): the name of the csv to be read
  • Return
    • Processed lexicon: (df)
  • Write
    • None

Strips unnecessary columns from Schaede's manually-tagged lexicon (.CSV), then convert text classes to ints.

  • Params
    • filename (str): the name of the csv to be read
  • Return
    • Processed lexicon: (df)
  • Write
    • None


Writes the given DataFrame to storage.

  • Params
    • sample_name (str): the name of the sample; used to construct filename
    • data (df): the DataFrame to export
    • extension (str): the extension to save the df as. optional; defaults to .CSV
  • Return
    • None
  • Write
    • data/train.{sample_name}{i}{extension} for index i


A more generalized version of export_data(). Doesn't prepend "train" to the filename and allows different filepaths.

  • Params
    • data (df): the DataFrame to export
    • sample (str): the type of sample + part of filename; can be blank
    • i (int): the index + part of filename; can be blank
    • path (str): the path to save the file; leave blank to save to CWD
    • prefix (str): the prefix of the filename (e.g. "topic", "report", etc.); can be blank
    • index (bool): if TRUE, write row names; default TRUE
  • Return
    • None
  • Write
    • {path}/{prefix}.{sample}{i}.CSV

This file reformats and cleans the data from into something the SVM can use.


Reads given DataFrame line-by-line. Some comments have tabs or commas, and that can cause issues depending on the file delimiter. Removes entries with missing values (there's only 1 in without a score)

  • Params
    • dataset (str): filename of dataset to import
    • verbose (verbose): toggles print statements; default TRUE
  • Return
    • Clean delimited data: (df)
  • Write
    • None


Reads kaggle_toxic training file and cleans it up + applies the correct header names.

  • Params _ None
  • Return
    • None
  • Write
    • data/src_new/kaggle-toxic_train-clean.CSV


Given a DataFrame, shuffle it and cut it down to the given size.

  • Params
    • data (df): data to sample
    • size (int): sample size
  • Return
    • Sampled data: (df)
  • Write
    • None


Given data, return only rows containing predefined abusive words. Or, if given a wordbank, return rows containing any of those words instead.

  • Params
    • data (df): DataFrame to boost
    • data_name (str): filename for print statements; ignored if verbose=FALSE
    • verbose (bool): controls verbosity; default TRUE
    • manual_boost ([str], or None): user-defined wordbank to boost on; default None
  • Return
    • Boosted data: (df)
  • Write
    • None

This file trains n SVMs for all three sample types, with n being the repeats flag.


This is where the magic happens. Fits CountVectorizer, trains SVM, and prints + exports results per dataset.

  • Params
    • rebuild (bool): if TRUE, rebuild + rewrite the following datasets:
    • samples ([str]): three modes: "random", "boosted", or "all"
    • analyzer (str): either "word" or "char". for CountVectorizer
    • ngram_range ((int,int)): tuple containing lower and upper ngram bounds for CountVectorizer
    • manual_boost ([str]): use given list of strings for filtering instead of built-in wordbanks. Or pass None
    • repeats (int): controls the number of datasets built per sample type (if rebuild is TRUE)
    • verbose (boolean): toggles print statements
    • sample_size (int): size of sampled datasets. If set too high, the smaller size will be used
    • calc_pct (bool): if TRUE, calculate percentage of explicitly abusive and implicitly abusive words in each sample
    • decimals (int): number of decimals to round percentages to
  • Return
    • None
  • Write
    • output/pred/pred.{sample_type}{i} for index i and string sample_type, both defined in-function
    • output/stats/percent_abusive/percent.{sample_type}{i} if calc_pct is TRUE
    • output/report/report.{sample_type}{i}


Helper function that queues datasets to be trained per sample. It reads n sets for the given sample_type

  • Params
    • sample_type (str): part of filename, used for reading it into memory
    • n (int): number of files per sample
  • Return
    • List of DataFrames: ([df])
  • Write
    • None


Helper function that checks for previously-computed y_pred. If it exists, print it; else, compute it.

  • Params
    • x (df): data to predict
    • y (df): class vector
    • clf (sklearn.pipeline.Pipeline): CountVectorizer and SVM models
    • k (int): number of folds to be used in cross-validation
    • sample_type (str): name of sample type; used for filename checks + exports
    • i (int): index; used for filename checks + exports
    • verbose (bool): used to control verbosity of import / fit steps
  • Return
    • None
  • Write
    • output/pred/pred.{sample_type}{i}.CSV if y_pred doesn't already exist for sample type and index i


Helper function that checks for previously-computed abusive-content percentages. If it exists, print it; else, compute it.

  • Params
    • data (df): data to compute percentages for
    • sample_type (str): name of sample type; used for filename checks + exports
    • i (int): index; used for filename checks + exports
    • verbose (bool): used to control verbosity of import / fit steps
  • Return
    • None
  • Write
    • output/stats/percent_abusive/percent.{sample_type}{i}.CSV if percentage doesn't already exist for sample type and index i


Wrapper, called from real main. Protects inner-scope variables in fit_data()

  • Params
    • None
  • Return
    • None
  • Write
    • None

If postprocessing wasn't already a word, it is now. This contains helper functions that work with data that has already been trained or processed.


This computes how much of data is considered abusive. Uses all three lexicons: manual, Wiegand Base, and Wiegand Extended. Returns a DataFrame with a column of lexicon names and calculated percentages.

  • Params
    • data (df): DataFrame to calculate abusive contents of
  • Return
    • DataFrame of results: (df)
  • Write
    • None

TODO: Utilize the multiprocessing library for Python for parallel boosts. I got it to work but it hung up when joining jobs due to the Queue object.


This simulates 5-fold cross validation on the sampled datasets (9 total), then calculates the percentage of out-of-vocabulary words per fold.

  • Params
    • k (int): number of folds to use for cross-validation
    • verbose (bool): toggle verbosity of function
  • Return
    • None
  • Write
    • stats/oov/oov.{sample_type}{i} for index i and string sample_type, both defined in-function


Given a DataFrame, split into k folds and return list of train + test splits

  • Params
    • data (df): the data to split
    • k (int): number of folds to use for cross-validation
    • state(int): controls the random state of sklearn.model_selection.KFold
  • Return
    • List of lists of DataFrames: ([[train1, test1], [train2, test2]...])
      • each train/test pair is for one fold
  • Write
    • None


Given a DataFrame and lexicon, return two sets: the words in both the df and lexicon (used words), and the words in the lexicon but not the df (unused words)>

  • Params
    • df (df): the DataFrame to
    • lex ([str])
  • Return
    • Set of used words ({})
    • Set of unused words ({})
  • Write
    • None

Given a DataFrame of the correct format, create a set of words used in its comment_text feature.

  • Params
    • data (df): the DataFrame to process
  • Return
    • Set of strings ({str})
  • Write
    • None


Thanks for reading, and happy training!