- Kaushik Pillalamarri
- Tanisha Khurana
- Vikram Pande
Integrating Semantic, Syntactic, and Contextual Elements for Humor Classification
In this work, we formulate humor recognition as a classification task in which we distinguish between humorous and non-humorous instances. Exploring the syntactical structure involves leveraging Lexicons to capture sentiment counts within a sentence, while Statistics of Structural Elements (SSE) encapsulates the statistical insights of Noun phrases, Word phrases, and more. Unveiling the semantic layers of humor delves into Word2Vec embeddings, analyzing incongruity, ambiguity, and phonetic structures within sentences. Additionally, contextual information is harnessed through ColBERT embeddings. For each latent structure, we design a set of features to capture the potential indicators of humor.
Set up the following environment
- python
- tensorflow
- scikit-learn
- pandas
- numpy
- NLTK
- shap
- seaborn
- matplotlib
- graphviz
- pickle
- transformers
- regex
- nrclex
- tqdm
- scipy
- gensim
conda create -n lolgorithm
conda activate lolgorithm
pip install tensorflow
pip install scikit-learn
pip install pandas
pip install numpy
pip install --user -U nltk
pip install shap
pip install seaborn
pip install -U matplotlib
pip install graphviz
pip install transformers
pip install NRCLex
pip install tqdm
pip install scipy
pip install --upgrade gensim
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('cmudict')
- Code: Contains Modular Python Files
data_NRC.py
: contains the functions to generate NRCLex features (Part of Syntactical features).SSE.py
: contains the functions to generate Statistics of Structural Elements features (Part of Syntactic features).semantic_features.py
: contains the functions to generate Semantic features.joke_scraper.py
: script to scrape jokes as unseen data.test_features.py
: script to generate features for unseen data.make_embed.py
: script to generate combined features - NRCLex, SSE, Semantic.baseline_model.py
: Baseline Decision Tree model for Feature Engineering.Colbert_training.py
: Script to train Colbert only with contextual embeddings.Colbert_w_training.py
: Script to train Colbert only with contextual and hand-crafted features.
- experiments: Contains Notebooks of Experiments performed.
Colbert_dataset.ipynb
: contains experiments with the Colbert dataset to get contextual embeddings.baseline_book.ipynb
: contains Decision Tree analysis on NRCLex features.feature_engg_notebook.ipynb
: contains feature engineering with Decision Trees and Gradient Boost with SHAP for all 4 features - nrclex, syntactic, semantic, combined.final.ipynb
: Notebook for inference on unseen data.semantic-word2vec-expts_v2.ipynb
: semantic feature experiments v2.semantic-word2vec-expts.ipynb
: semantic feature experiments.sse_book.ipynb
: Baseline Decision Tree model on structural symantic elements.Colbert_train.ipynb
: Experiments to train Colbert with contextual and hand-crafted features.
- dataset: Contains Data files.
combined-features.csv
: contains combined NRCLex, SSE, and Semantic features (200000, 33)dataset.csv
: ColBERT dataset containing jokes and labels.nrclex-features.csv
: NRCLex featuressyntactic-features.csv
: Statistics of Structural Elements (SSE) featuressemantic-features.csv
: Semantic features - Incongruity, Ambiguity, Phonetic Stylesample_input.csv
: Sample ColBERT model input to accept.reddit_test_features.csv
: Unseen scraped dataset
- figures: Contains Figures and Graphs.
- Figures of Decision Trees and SHAP analysis for feature engineering on NRCLex, Syntactic, Semantic, and Combined features to find important features for decision.
- models: Contains Saved Models.
- Decision Tree, GradientBoost models for feature engineering on NRCLex, Syntactic, Semantic, and Combined features.
- The directory contains all the necessary files, download/clone the repository.
- Copy the data in the same directory.
- Run
bert_combined_feats.ipynb
used for predictions of unseen test data + Hand crafted features with BERT. - Run
colbert_feature_pred.ipynb
used for predictions of unseen test data + Hand crafted features with Colbert.