We participated in the Text Classification Competition for Sarcasm Detection in Tweets. Our team beat the baseline (0.723) and achieved an F1 score of 0.7542963307013469
.
The code can be used for training a preprocessing the given dataset (train.jsonl and test.jsonl) and train a BERT model. The usage of our solution can be found in the "Source Code Walkthrough section".
- Suraj Bisht [email protected] (Team Leader)
- Sithira Serasinghe [email protected]
- Santosh Kore [email protected]
- Anaconda 1.9.12
- Python 3.8.3
- PyTorch 1.7.0
- Transformers 3.0.0
Make sure to run this program in an Ananconda environment (i.e. Conda console). This has been tested on *nix and Windows systems.
1. Libs
pip install tweet-preprocessor textblob wordsegment contractions tqdm
2. Download TextBlob corpora
python -m textblob.download_corpora
3. Install PyTorch & Transformers
conda install pytorch torchvision torchaudio cpuonly -c pytorch transformers
If it complains that the transformers
lib's not installed, try this command:
conda install -c conda-forge transformers
First, cd src
and run the following commands,
tl;dr
python clean.py && python train.py && python eval.py
This will preprocess, train and generate the answer.txt
model which can be then submitted to the grader for evaluation.
Description of each step:
-
Clean the dataset
python clean.py
-
Train the model
python train.py
Once the model is trained it will create an
input/model.bin
file which saves our model to a binary file. We can later load this file (in the evaluation step) to make predictions. -
Make predictions & create the answer.txt file
python eval.py
The answer.txt file is created at theoutput
folder.
The following section describes each of these steps in-depth.
We perform data cleaning steps for both train.jsonl
and test.jsonl
so that they are normalized for training and evaluation purposes. The algorithm for cleaning the data is as follows:
For each tweet:
- Append all
context
to become one sentence and prefix it to theresponse
. - Fix the tweet if it has special characters to support better expansion of contractions.
- Remove all digits from the tweets.
- Remove
<URL>
and@USER
as they do not add any value. - Convert all tweets to lowercase.
- Use NLTK's tweet processor to remove emojis, URLs, smileys, and '@' mentions
- Do hashtag segmentation to expand any hashtags to words.
- Expand contracted words.
- Remove all special symbols.
- Perform lemmatization on the words.
A model can be built and trained with the provided parameters by issuing a python train.py
command. The following steps are run in sequence during the model training.
- Read in the train.csv from the prior step.
- Training dataset (5000 records) is split into training and validation as 80:20 ratio.
- Feed in the parameters to the model.
- Perform model training for the given number of epochs.
- Calculate validation accuracy for each run and save the best model as a bin file
The following can be considered as parameters that could be optimized to achieve a better result.
src/config.py
DEVICE = "cpu" # If you have CUDA GPU, change this to 'cuda'
MAX_LEN = 256 # Max length of the tokens in a given document
EPOCHS = 5 # Number of epochs to train the model for
BERT_PATH = "bert-base-uncased" # Our base BERT model. Can plug in different models such as bert-large-uncased
TRAIN_BATCH_SIZE = 8 # Size of the training dataset batch
VALID_BATCH_SIZE = 4 # Size of the validation dataset batch
src/train.py
L25: test_size=0.15 # Size of the validation dataset
L69: optimizer = AdamW(optimizer_parameters, lr=2e-5) # A different optimizer can be plugging or a learning rate can be defined here
L71: num_warmup_steps=2 # No. of warmup steps that need to run before the actual training step
src/model.py
L13: nn.Dropout(0.1) # Configure the dropout value
A high-level view of the sequence of operations run during the evaluation step is as follows.
- Load the test.csv file from the data transformation step.
- Load the best performing model from the training step.
- Perform predictions for each test tweet (1800 total records)
- Generate answer.txt that will be submitted to the grader to the "output" folder.
- Suraj Bisht [email protected] (Team Leader)
- Improve the initial coding workflow (Google Colab, Local setup etc.).
- Investigating Sequential model, Logistic Regression, SVC etc.
- Investigating
bert-base-uncased
model. - Investigating data preprocessing options.
- Hyperparameter tuning to improve the current model.
- Sithira Serasinghe [email protected]
- Setting up the initial workflow.
- Investigating LSTM/BiDirectional LSTM, Random Forest etc.
- Investigating various data preprocessing options.
- Investigating
bert-base-uncased
model. - Hyperparameter tuning to improve the current model.
- Santosh Kore [email protected]
- Improve the initial coding workflow (Google Colab, Local setup etc.).
- Investigating Sequential models, SimpleRNN, CNN etc.
- Investigating
bert-large-uncased
model. - Investigating data preprocessing options.
- Hyperparameter tuning to improve the current model.
- Cleaning data further with different methods.
- Optimizing BERT model parameters and trying different BERT model (eg. RoBERTa)
- Re-use some of the tried models and optimizing to beat F1 scores.
- Extract Emoji's to add more meaning to the sentiments of the tweets.
- Data augmentation steps to prevent overfitting.
- Try an ensemble of models (eg. BERT + VLaD etc. )
- Run our model on different test data and compare results against state-of-art.
The usage of BERT model is inspired by https://github.com/abhishekkrthakur/bert-sentiment