Skip to content

Classification of reddit flairs using a dataset generated by scraping Reddit data with PRAW

Notifications You must be signed in to change notification settings

11-aryan/Reddit-Flair-Classification

Repository files navigation

Reddit-Flair-Classification

Each Reddit post is tagged for filtering purposes. These tags are called flairs in the Reddit world.

In this project, a comparative data analysis using existing Machine Learning and Natural language processing techniques is provided to detect the flair of each Reddit post using the data generated by webscraping reddit using PRAW (Python Reddit API Wrapper).

Data analysis was done on the data using different features and a pipeline of various natural language processing techniques like Count Vectorization and Tfldf Transformation, and various machine learning techniques like , Decision Tree, Support Vector Machines (SVM), Logistic Regression, Naive-Bayes, pretrained BERT, was used to research on the data, and classify the flairs

Link to the web app: https://reddit-flair-classification.onrender.com/
(If the page shows error wait for sometime and reload the page)

Requirements:

python 3.7.1
praw==6.5.1
pyqt5==5.7.1
nltk==3.4.5
Flask==1.0.2
numpy==1.16.4
gunicorn==19.9.0
Jinja2==2.11.3
Werkzeug==0.15.6
MarkupSafe==1.1.1
Click==7.0
itsdangerous==1.1.0
scikit-learn==0.22.2.post1

Description of Jupyter Notebooks

Flair_Classification_1_Generating_Data.ipynb : Contains the code to extract data from Reddit with PRAW and create the dataset.

Flair_Classification_2_Training.ipynb: Contains EDA and the comparison of the performances of different models on the data.

Flair_Classification_BERT.ipynb: Using pretrained BERT from tensorflow hub. n

Generating the dataset

The data was generated using PRAW (Python Reddit API warpper) to extract data from reddit posts followed by text preprocessing. There might be many types of flairs, but the 13 most common flairs are used to train the data to avoid class imbalance. The final data has 100 samples of each flair, and a total of 1120 records.

Here's a sample of the data:

Training Results

The following accuracy was obtained with different models:

  • Linear Regression: 0.5090
  • Support Vector Machine:0.5060
  • Naive Bayes: 0.4818
  • Decision Tree: 0.4909
  • Random Forest: 0.5242

Random Forest performed slightly better than others, but the accuracy is still quite low.
Using BERT to train on the text embeddings obtained from Universal Sentence Encoder from tensorflow hub significantly improved the accuracy to 0.7095