Skip to content

rohitshubham/Cloud-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloud-pipeline Subsystem

Build Status database database

About the tool

This tool creates a simple cloud system component of a big data processing architectecture. The repository includes basic code and tools required for readily setup a cloud component of a big data pipeline to enable developers and researchers perform additional testing of the code. This project works syncronously with the edge-sim project to quickly create a three tiered architecture.


Different components

The overall workflow/architecture can be seen in Figure 1.

The different components available are:

  • zookeeper : Service discovery container required for apache kafka
  • kafka: Message broker with scalability (See how to scale up/down the service below)
  • database: MongoDB database for stream data ingestion
  • stream-processor: Provides streaming analytics by consuming kafka messages for a fixed window interval
  • database-processor: Provides database ingestion into the mongodb database
  • spark: The master/controller instance of Apache Spark Node
  • spark-worker: Worker nodes that connect the spark service (See how to scale up/down the service below).
  • mongo-express: Tool for connecting to database service

In addition, we also have Kafka message consumer code in /Util directory.

architecture

  • Figure 1: Workflow and architecture of cloud pipeline sub-systems

Running the Kafka

To run the cloud pipeline service, we need to perform the follwing:

1. Starting Kafka

To start Kakfa, first run zookeeper:

$ docker-compose up -d zookeeper

Next start the Kafka brokers by:

$ docker-compose up --scale kafka=NUMBER_OF_BROKERS

2. Start Mongo Databse

To start MongoDB, just run the command:

$ docker-compose up -d database

3. Start the database-consumer:

To start the Kafka consumer service, run the following command while Kafka is running:

$ docker-compose up  --scale kafka=NUMBER_OF_BROKERS  database-processor

Note: The Kafka Consumer requires a comma seperated list of Kafka brokers. It has to be provided in the entrypoint config of the docker-compose.yml file. Example: entrypoint: ["python3", "MongoIngestor.py", "192.168.1.12:32812,192.168.1.12:32814", "kafka-database-consumer-group-2"]

4. Start Apache Spark

To start spark, run the following docker-compose command

  • Start master/controller node
$ docker-compose up spark
  • Start multiple instances of worker/resposender node
$ docker-compose scale spark-worker=2

5. Start the stream-processor application

To start the stream-processor application, use the following command:

$ docker-compose up stream-processor

6. Start the database-ingestion application

Start this using:

$ docker-compose up --scale kafka=2 database-processor

Note: The database-processor and stream-processor applications both belong to separate consumer groups in Kafka. As such, running both of them will provide simultaneous stream ingestion and processing capability.

7. Kafka Message Consumer

The python script to consume any message on any topic in present in /Utils folder. Launch it as:

$ python3 client-report.py "kafka_broker" "topic_to_connect"

for example:

$ python3 client-report.py "192.168.1.12:32812,192.168.1.12:32814" report

Monitoring the application

The recommended application for monitoring is netdata.

architecture

  • Figure 2: Sample application monitoring (Notice the containers at the bottom right)