spark-postgre-etl

Dockerised Apache spark 2.x job to pull MovieLens data, process it and store the results into post-gres database

Setting up the environments

Requirements:

Clone the code base from this github repo using command:

git clone https://github.com/mithunmanohar/spark-postgres-etl.git
Change directory to folder:

cd spark-postgres-etl
Build and start the spark container using docker compose file from this folder using command:

docker-compose up --build -d
Log into the master docker container using command:

docker exec -it sparkpostgresetl_master_1 /bin/bash
move to /spark_etl folder

cd /spark_etl
install the python requirements using pip:

pip install -r requirements.txt
run script to pull data to local, create database, insert default schemas, data python etl_manager.py

. Run the first spark job using spark-submit

**``/usr/spark-2.4.0/bin/spark-submit --driver-class-path /spark_etl/data_manager/postgresql-42.2.5.jar  /spark_etl/jobs/top_movies_per_decade.py``**

Database schema Two tables - stores movie categories, movie ranks

t_movie_category: stores the movie categories

id SERIAL PRIMARY KEY , category VARCHAR(50) NOT NULL, category_id INT NOT NULL
t_movie_rank - stores decade wise movie ranks per category

id SERIAL PRIMARY KEY, decade INT NOT NULL, category_id INT NOT NULL, rank INT NOT NULL, movie_id INT NOT NULL, movie_name VARCHAR(256)