Simple pipeline for real-time text classification of tweets streamed from Twitter API.
There is a message producer that reads tweets, given one or more words to track, then streams them over Kafka so a consumer can proceed with the text classification. Using Kafka here makes it very easy to plug in additional steps in this pipeline or even serve different pipelines and stuff.
Before try it out, you got to provide a few Twitter API keys which you can find a place in .env.sample
. And as you can see there, I suppose you would like to use virtualenv
too. Sorry my presuntion though.
So our prerequisites here are:
- Python (https://www.python.org/)
- Virtualenv (https://virtualenv.pypa.io/)
- Apache Kafka (https://kafka.apache.org/)
- Twitter API (https://developer.twitter.com/)
As far as Kafka goes, if you have Docker installed, no worries, I got your back, a.k.a. docker-compose.yml
.
When you are good to go with python
and virtualenv
installed and those Twitter keys on hand:
- Open a terminal,
git clone
this repo wherever you like, andcd
into it - Rename
.env.sample
to just.env
- Add those Twitter keys on
.env
- Set
KAFKA_BROKER
on.env
as you like or leave it as is - Run
virtualenv .venv
to create a new virtual environment - Run
source .venv
to load this new virtual environment - Run
pip install -r requirements.txt
to install all dependencies
Done.
First thing to do is to train the model. This is a one time kind of thing. So open a terminal and fire:
$ source .env
$ python trainer.py
Now, if you don't already have a Kafka up and running, you can use the provided docker-compose.yml
:
$ export DOCKERHOST=`docker-machine ip`
$ docker-compose up -d
Once it is ready, you can start the consumer
in one terminal:
$ source .env
$ python consumer.py
And finally, start the producer
in another one:
$ source .env
$ python producer.py "Java" "PHP" "JavaScript"
Now you can stalk them all... ho ho ho