Tweet Analysis using Kafka and Spark Streaming

Built a real-time analytics dashboard to visualize the trending hashtags and @mentions at a given location by using real time streaming twitter API to get data.

Installation Guide

Download and Install Kafka, Spark, Python and npm.

You can refer to following guide to install kafka.

https://towardsdatascience.com/running-zookeeper-kafka-on-windows-10-14fc70dcc771

Spark can be downloaded from following link

https://spark.apache.org/downloads.html

How to run the code.

Create kafka topic.

You can refer to below link

https://dzone.com/articles/running-apache-kafka-on-windows-os

Or run following command

kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic twitter

Update conf file with your secret key and access tokens.
Install Python dependencies.

 pip install -r requirements.txt

Install Node js dependencies.

npm install

Start Zookeeper

Open cmd and execute

zkserver

Start Kafka

Go to Kafka installation directory. ..\kafka_2.11-2.3.1\bin\windows. Open cmd here and execute following command.

kafka-server-start.bat C:\ProgramData\Java\kafka_2.11-2.3.1\config\server.properties

Run python file to fetch tweets.

python fetch_tweets.py

Run python file to analyze tweets.

python analyze_tweets.py

Start npm server

npm start

Technology stack

Area	Technology
Front-End	HTML5, Bootstrap, CSS3, Socket.IO, highcharts.js
Back-End	Express, Node.js
Cluster Computing Framework	Apache Spark (python)
Message Broker	Apache kafka

Architecture

How it works

Extract data from Twitter's streaming API and put it into Kakfa topic.
Spark is listening to this topic, it will read the data from topic, analyze it is using spark streaming and put top 10 trending hashtags and @mentions into another kafka topic.
Spark Streaming creates DStream whenever it read the data from kafka and analyze it by performing operation like map, filter, updateStateByKey, countByValues and forEachRDD on the RDD and top 10 hashtags and mentions are obtained from RDD using SparkSQL.
Node.js will pick up the this data from kafka topic on server side and emit it to the socket.
Socket will push data to user's dashboard which is rendered using highcharts.js in realtime.
The dashboard is refreshed every 60 secs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tweet Analysis using Kafka and Spark Streaming

Installation Guide

Download and Install Kafka, Spark, Python and npm.

How to run the code.

Technology stack

Architecture

How it works

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tweet Analysis using Kafka and Spark Streaming

Installation Guide

Download and Install Kafka, Spark, Python and npm.

How to run the code.

Technology stack

Architecture

How it works