Reddit-Flair-Classification

Each Reddit post is tagged for filtering purposes. These tags are called flairs in the Reddit world.

In this project, a comparative data analysis using existing Machine Learning and Natural language processing techniques is provided to detect the flair of each Reddit post using the data generated by webscraping reddit using PRAW (Python Reddit API Wrapper).

Data analysis was done on the data using different features and a pipeline of various natural language processing techniques like Count Vectorization and Tfldf Transformation, and various machine learning techniques like , Decision Tree, Support Vector Machines (SVM), Logistic Regression, Naive-Bayes, pretrained BERT, was used to research on the data, and classify the flairs

Link to the web app: https://reddit-flair-classification.onrender.com/
(If the page shows error wait for sometime and reload the page)

Requirements:

python 3.7.1
praw==6.5.1
pyqt5==5.7.1
nltk==3.4.5
Flask==1.0.2
numpy==1.16.4
gunicorn==19.9.0
Jinja2==2.11.3
Werkzeug==0.15.6
MarkupSafe==1.1.1
Click==7.0
itsdangerous==1.1.0
scikit-learn==0.22.2.post1

Description of Jupyter Notebooks

Flair_Classification_1_Generating_Data.ipynb : Contains the code to extract data from Reddit with PRAW and create the dataset.

Flair_Classification_2_Training.ipynb: Contains EDA and the comparison of the performances of different models on the data.

Flair_Classification_BERT.ipynb: Using pretrained BERT from tensorflow hub. n

Generating the dataset

The data was generated using PRAW (Python Reddit API warpper) to extract data from reddit posts followed by text preprocessing. There might be many types of flairs, but the 13 most common flairs are used to train the data to avoid class imbalance. The final data has 100 samples of each flair, and a total of 1120 records.

Here's a sample of the data:

Training Results

The following accuracy was obtained with different models:

Linear Regression: 0.5090
Support Vector Machine:0.5060
Naive Bayes: 0.4818
Decision Tree: 0.4909
Random Forest: 0.5242

Random Forest performed slightly better than others, but the accuracy is still quite low.
Using BERT to train on the text embeddings obtained from Universal Sentence Encoder from tensorflow hub significantly improved the accuracy to 0.7095

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Data		Data
Images		Images
Flair_Classification_1_Generating_Data.ipynb		Flair_Classification_1_Generating_Data.ipynb
Flair_Classification_2_Training.ipynb		Flair_Classification_2_Training.ipynb
Flair_Classification_BERT.ipynb		Flair_Classification_BERT.ipynb
README.md		README.md
preprocess_text_util.py		preprocess_text_util.py
savedModel1.pkl		savedModel1.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit-Flair-Classification

Requirements:

Description of Jupyter Notebooks

Generating the dataset

Training Results

About

Releases

Packages

Languages

11-aryan/Reddit-Flair-Classification

Folders and files

Latest commit

History

Repository files navigation

Reddit-Flair-Classification

Requirements:

Description of Jupyter Notebooks

Generating the dataset

Training Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages