PaperSorter is an academic paper recommendation system that utilizes machine learning techniques to match users' interests. The system retrieves article alerts from RSS feeds and processes the title, author, journal name, and abstract of each article using Upstage's Solar LLM to generate embedding vectors. These vectors serve as input for a regression model that predicts the user's level of interest in each paper. PaperSorter sends notifications about high-scoring articles to a designated Slack channel, enabling timely discussion of relevant publications among colleagues. The prediction model can be trained incrementally with additional labels for new articles provided by user.
To install PaperSorter, use pip:
pip install papersorter
PaperSorter uses TheOldReader as its
feed source. After signing up for TheOldReader, you will receive
API access using your email and password. Before running PaperSorter,
make sure to set the TOR_EMAIL
and TOR_PASSWORD
environment
variables with your TheOldReader email and password, respectively.
This will allow PaperSorter to authenticate and retrieve the necessary
data from your feeds.
Embeddings API such OpenAI's
and Solar LLM
converts article titles and contents into numerical vectors.
Sign up on the OpenAI API or
Upstage console and create an API key as per the
documentation. Store the key securely and set the PAPERSORTER_API_KEY
environment
variable before running PaperSorter.
To send notifications to a Slack channel, create an incoming webhook
address as described in the Slack documentation.
Store the address securely and set the PAPERSORTER_WEBHOOK_URL
environment
variable before running PaperSorter.
To train a predictor for your article interests, ensure your TheOldReader account contains at least 1000 articles, including at least 100 positively labeled articles marked with stars. Ideally, aim for around 5000 articles with 500 starred items for optimal performance.
After populating your TheOldReader account, initialize the feed and embedding databases using:
papersorter init
Next, train your first model with:
papersorter train
If the ROCAUC performance metric meets your expectations, you're ready to send notifications about new interesting articles.
For the regular updates, this command retrieves updates, converts new items to embeddings, and finds interesting articles:
papersorter update
To send notifications for new interesting articles, run:
papersorter broadcast
You will receive formatted notifications in your Slack channel.
Here is an example of a shell script that runs PaperSorter's update
and broadcast
jobs in the background. This script sends notifications
about new interesting articles between 7 am and 9 pm, while only
performing updates during the night.
#!/bin/bash
PAPERSORTER_CMD=/path/to/papersorter
PAPERSORTER_DATADIR=/path/to/data
LOGFILE=background-updates.log
CURRENT_HOUR=$(date +%H)
cd $PAPERSORTER_DATADIR
$PAPERSORTER_CMD update -q --log-file $LOGFILE
if [ "$CURRENT_HOUR" -ge 7 ] && [ "$CURRENT_HOUR" -le 21 ]; then
$PAPERSORTER_CMD broadcast -q --log-file $LOGFILE
fi
Here is an example line for the crontab. It runs the update script on every hour at ten minutes past the hour.
10 * * * * /bin/bash /path/to/run-update.sh
To improve the model, provide more labels for the articles. First, extract the list of articles with the following command:
papersorter train -o model-temporary.pkl -f feedback.xlsx
This generates an Excel file, feedback.xlsx
, containing titles,
authors, prediction scores, and other details. Review each row and
fill in the label
column with 1
(interesting) or 0
(not interesting).
Leave it blank if unsure. Once you've labeled some articles, update
the feed database with:
papersorter feedback -i feedback.xlsx
Retrain the predictor with the updated labels using:
papersorter train
The new predictor is stored as model.pkl
, and your next feeds will
be assessed with the updated model.
Hyeshik Chang [email protected]