RealTime StockStream is a streamlined system for processing live stock market data. It uses Apache Kafka for data input, Apache Spark for data handling, and Apache Cassandra for data storage, making it a powerful yet easy-to-use tool for financial data analysis πΉποΈ
This guide will walk you through setting up and running the RealTime StockStream on your local machine for development and testing.
Ensure you have the following software installed:
- Docker
- Python (version 3.11 or higher)
- Live Market Data Integration β
- Advanced Analytics Features β
- Interactive Data Visualization β
- Improved Scalability β
- User Customization Options β
- Stronger Security β
- Appache Kafka
- Appache Cassandra
- Appache ZooKeeper
- Appache Spark
- Python
Follow these steps to set up your development environment:
- Create a Kafka Topic:
kafka-topics.sh --create --topic stocks --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
- Grouping Aggregation: Summarize data by groups.
- Pivot Aggregation: Reshape data, converting rows to columns.
- Rollups and Cubes: Perform hierarchical and combinational aggregations.
- Ranking Functions: Assign ranks within data partitions.
- Analytic Functions: Compute aggregates while maintaining row-level details.
- Create a Keyspace and Table:
Execute the following CQL commands to set up your Cassandra database:
CREATE KEYSPACE stockdata WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1}; CREATE TABLE stockdata.stocks ( stock text, trade_id uuid, price decimal, quantity int, trade_type text, trade_date date, trade_time time, PRIMARY KEY (stock, trade_id) );
-
Launch Services: Use Docker Compose to start Kafka, Zookeeper, Cassandra, and Spark services:
version: '3.9' name: "realtime-stock-market" services: zookeeper: image: bitnami/zookeeper:latest ports: - "2181:2181" environment: - ALLOW_ANONYMOUS_LOGIN=yes networks: stock-net: ipv4_address: 172.28.1.1 kafka: image: bitnami/kafka:latest ports: - "9092:9092" environment: - KAFKA_BROKER_ID=1 - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092 - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://172.28.1.2:9092 - KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181 - ALLOW_PLAINTEXT_LISTENER=yes depends_on: - zookeeper networks: stock-net: ipv4_address: 172.28.1.2 volumes: - ./scripts/init-kafka.sh:/init-kafka.sh # entrypoint: ["/bin/bash", "init-kafka.sh"] restart: always cassandra: image: cassandra:latest ports: - "9042:9042" volumes: - ./init-cassandra:/init-cassandra - ./scripts/init-cassandra-schema.sh:/init-cassandra-schema.sh environment: - CASSANDRA_START_RPC=true networks: stock-net: ipv4_address: 172.28.1.3 # entrypoint: ["/bin/bash", "init-cassandra-schema.sh"] restart: always spark: image: bitnami/spark:latest volumes: - ./spark:/opt/bitnami/spark/jobs - ./scripts/submit-spark-job.sh:/opt/bitnami/spark/submit-spark-job.sh ports: - "8080:8080" depends_on: - kafka networks: stock-net: ipv4_address: 172.28.1.4 # entrypoint: ["sh", "-c", "./submit-spark-job.sh"] restart: always kafka_producer: build: context: ./kafka-producer dockerfile: kafka_producer.dockerfile depends_on: - kafka networks: stock-net: ipv4_address: 172.28.1.8 restart: always plotly: build: context: ./plotly dockerfile: plotly.dockerfile volumes: - ./plotly/dashboard.py:/dashboard.py ports: - "8050:8050" depends_on: - cassandra networks: stock-net: ipv4_address: 172.28.1.9 restart: always networks: stock-net: driver: bridge ipam: config: - subnet: 172.28.0.0/16
-
Run Docker Compose:
docker-compose up -d
-
Run the Spark Job: Use the
spark-submit
command to run your Spark job.$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 spark_job.py stocks
-
Produce and Consume Data: Start producing data to the
stocks
topic and monitor the pipeline's output.
Check the logs for each service in their respective directories for monitoring and debugging.
To run the dashbaord, you need to run the following command:
$ cd plotly & python3 dashboard.py
Contributions to RealTime StockStream are welcome, just open a PR π.
This project is licensed under the MIT License.