Dockerised Apache spark 2.x job to pull MovieLens data, process it and store the results into post-gres database
Requirements:
- git
- docker
- docker-compose
-
Clone the code base from this github repo using command:
git clone https://github.com/mithunmanohar/spark-postgres-etl.git
-
Change directory to folder:
cd spark-postgres-etl
-
Build and start the spark container using docker compose file from this folder using command:
docker-compose up --build -d
-
Log into the master docker container using command:
docker exec -it sparkpostgresetl_master_1 /bin/bash
-
move to /spark_etl folder
cd /spark_etl
-
install the python requirements using pip:
pip install -r requirements.txt
-
run script to pull data to local, create database, insert default schemas, data
python etl_manager.py
. Run the first spark job using spark-submit
**``/usr/spark-2.4.0/bin/spark-submit --driver-class-path /spark_etl/data_manager/postgresql-42.2.5.jar /spark_etl/jobs/top_movies_per_decade.py``**
Database schema Two tables - stores movie categories, movie ranks
-
t_movie_category: stores the movie categories
id SERIAL PRIMARY KEY , category VARCHAR(50) NOT NULL, category_id INT NOT NULL
-
t_movie_rank - stores decade wise movie ranks per category
id SERIAL PRIMARY KEY, decade INT NOT NULL, category_id INT NOT NULL, rank INT NOT NULL, movie_id INT NOT NULL, movie_name VARCHAR(256)
Output data : https://github.com/mithunmanohar/spark-postgres-etl/blob/master/spark_etl/output.txt