reddit-stance-classifier

A Flask webapp & Python scripts for predicting reddit users' political leaning, using their comment history. The backend is Flask with a PostgreSQL database which is queried using SQLAlchemy.

Usage

View the live webapp

A model has been trained and pickled already in models/ensemble.pkl If you wish to train your own model, first the postgres db must be set up and data added to it using scraper.py.

train_model.py can then be run with various flags

Model

Currently the only features this model uses are the frequency of comments made in different subreddits. A typical instance of data used for training looks like:

Example instance of data

"userMcUserFace01010101": {
  "stance": "libleft",
  "subs": {
    "rollercoasters": 1037,
    "CasualUK": 101,
    "PewdiepieSubmissions": 90,
    "polandball": 68,
    "unpopularopinion": 65,
    "todayilearned": 64,
    "LosAngeles": 62,
    "ShitAmericansSay": 53,
    "im14andthisisdeep": 53,
    "london": 32,
    "reclassified": 26,
    "TheRightCantMeme": 25,
    "CringeAnarchy": 22,
    "HongKong": 22,
    "MovieDetails": 22,
    "GenZ": 21,
  }
}

Here "stance" is the target data which we want to predict. This is encoded as a pair, mirroring the "stance"'s position on the political compass. The encoding is shown below:

stancemap = {
  'libleft': (-1, -1), 
  'libright': (-1, 1), 
  'authleft': (1, -1), 
  'authright': (1, 1),
  'left': (0, -1),
  'right': (0, 1),
  'centrist': (0, 0),
  'auth': (1, 0),
  'lib': (-1, 0)
}

The target data are user flairs sampled from r/politicalcompassmemes.

Conclusion

As of writing a precision and recall of ~0.8 can be achieved on the unseen test set. It is important to note however, that there may be significant selection bias as all instances of data are from users of r/politicalcompassmemes. Therefore it remains to be seen whether this approach to identifying political positions will generalise to the Reddit population as a whole and make sensible predictions.

Due to the significant class imbalance present in the training data (the number of users that lean 'lib' on the v axis is far greater than those who lean 'auth'). It may be useful to consider alternative metrics such as bACC or PPCR.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
models		models
static		static
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
config.py		config.py
connections.py		connections.py
custom_transformers.py		custom_transformers.py
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
nginx.conf		nginx.conf
prediction.py		prediction.py
preds.pkl		preds.pkl
requirements.txt		requirements.txt
scraper.py		scraper.py
tables.py		tables.py
train_model.py		train_model.py
user_profiles.json		user_profiles.json
utils.py		utils.py

0xTiger/reddit-stance-classifier

Folders and files

Latest commit

History

Repository files navigation

reddit-stance-classifier

Usage

Model

Example instance of data

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages