This repository contains a parameterized PySpark job for Torus custom dataset creation, along with scripts to deploy, run, and manage dependencies and configurations for executing within AWS EMR Serverless.
The entrypoint for the PySpark job for custom dataset generation is defined in job.py
. Supporting
modules are found in the dataset
directory. To be invoked in the AWS EMR Serverless environment,
these files must be deployed and accessible from an S3 bucket.
The deploy.sh
script automates packaging and uploading the PySpark job script and dependencies to this S3 bucket.
- Run the
deploy.sh
script from the root directory:./deploy.sh
A job can be manually invoked from EMR Serverless Studio, but also directly from the commandline using one of two helper bash scripts. These bash scripts are wrappers around the AWS commandline tool, which you need to install from (https://aws.amazon.com/cli/)
- Run the
run_job.sh
script from the root directory with arguments for action, event subtypes, and section ids./run_job.sh attempt_evaluated part_attempt_evaluated 2342,2343
- Generate the context JSON file using the
context.sql
and manually upload it to thetorus-datasets-prod
bucket in thecontexts
folder, named the same as the job id sepecified in the next step. - Run the
run_datashop.sh
script from the root directory with arguments for job id and the course section ids./run_datasohp.sh 1922 2342,2343
For the above to work, the context file must be named 1922.json
and be present in the contexts
folder.
The dependencies needed by code executed by worker and executor nodes in PySpark are supplied via a custom EMR Docker image. Periodically,
this image may need to be updated as we expand the feature set. The
Dockerfile is present at config/Dockerfile
and the script
update_image.sh
automates the building and deployment of it.