Each Reddit post is tagged for filtering purposes. These tags are called flairs in the Reddit world.
In this project, a comparative data analysis using existing Machine Learning and Natural language processing techniques is provided to detect the flair of each Reddit post using the data generated by webscraping reddit using PRAW (Python Reddit API Wrapper).
Data analysis was done on the data using different features and a pipeline of various natural language processing techniques like Count Vectorization and Tfldf Transformation, and various machine learning techniques like , Decision Tree, Support Vector Machines (SVM), Logistic Regression, Naive-Bayes, pretrained BERT, was used to research on the data, and classify the flairs
Link to the web app: https://reddit-flair-classification.onrender.com/
(If the page shows error wait for sometime and reload the page)
python 3.7.1
praw==6.5.1
pyqt5==5.7.1
nltk==3.4.5
Flask==1.0.2
numpy==1.16.4
gunicorn==19.9.0
Jinja2==2.11.3
Werkzeug==0.15.6
MarkupSafe==1.1.1
Click==7.0
itsdangerous==1.1.0
scikit-learn==0.22.2.post1
Flair_Classification_1_Generating_Data.ipynb : Contains the code to extract data from Reddit with PRAW and create the dataset.
Flair_Classification_2_Training.ipynb: Contains EDA and the comparison of the performances of different models on the data.
Flair_Classification_BERT.ipynb: Using pretrained BERT from tensorflow hub.
n
The data was generated using PRAW (Python Reddit API warpper) to extract data from reddit posts followed by text preprocessing. There might be many types of flairs, but the 13 most common flairs are used to train the data to avoid class imbalance. The final data has 100 samples of each flair, and a total of 1120 records.
The following accuracy was obtained with different models:
- Linear Regression: 0.5090
- Support Vector Machine:0.5060
- Naive Bayes: 0.4818
- Decision Tree: 0.4909
- Random Forest: 0.5242
Random Forest performed slightly better than others, but the accuracy is still quite low.
Using BERT to train on the text embeddings obtained from Universal Sentence Encoder from tensorflow hub significantly improved the accuracy to 0.7095