- Create a work directory and make it as the current directory, such as
mkdir ~/workspace
cd ~/workspace
-
Create the application configuration as this
-
Prepare test data
mkdir data/users mkdir data/train
-
Create sql statement for transforming users & train
mkdir scripts
Create a transform-user-train.sql file as this
-
Create the pipeline as this
-
Compile the project & copy the jar file (spark-etl-framework-xxx.jar) to the current directory.
-
Submit the job
spark-submit --master local --deploy-mode client \ --name user-train --conf spark.executor.memory=8g --conf spark.driver.memory=4g \ --class com.qwshen.Launcher spark-etl-framework-xxx.jar \ --pipeline-def ./pipeline_fileRead-fileWrite.xml --application-conf ./application.conf \ --var application.process_date=20200921
-
Check & review the result in data/features