Custom Dataset Creation

This repository contains a parameterized PySpark job for Torus custom dataset creation, along with scripts to deploy, run, and manage dependencies and configurations for executing within AWS EMR Serverless.

Deployment

The entrypoint for the PySpark job for custom dataset generation is defined in job.py. Supporting modules are found in the dataset directory. To be invoked in the AWS EMR Serverless environment, these files must be deployed and accessible from an S3 bucket.

The deploy.sh script automates packaging and uploading the PySpark job script and dependencies to this S3 bucket.

Steps to Deploy:

Run the deploy.sh script from the root directory:
```
./deploy.sh
```

Running the PySpark Job

A job can be manually invoked from EMR Serverless Studio, but also directly from the commandline using one of two helper bash scripts. These bash scripts are wrappers around the AWS commandline tool, which you need to install from (https://aws.amazon.com/cli/)

Steps to Run a CSV raw data job:

Run the run_job.sh script from the root directory with arguments for action, event subtypes, and section ids
```
./run_job.sh attempt_evaluated part_attempt_evaluated 2342,2343
```

Steps to Run a Datashop XML job:

Generate the context JSON file using the context.sql and manually upload it to the torus-datasets-prod bucket in the contexts folder, named the same as the job id sepecified in the next step.
Run the run_datashop.sh script from the root directory with arguments for job id and the course section ids
```
./run_datasohp.sh 1922 2342,2343
```

For the above to work, the context file must be named 1922.json and be present in the contexts folder.

Updating the Custom Docker Image

The dependencies needed by code executed by worker and executor nodes in PySpark are supplied via a custom EMR Docker image. Periodically, this image may need to be updated as we expand the feature set. The Dockerfile is present at config/Dockerfile and the script update_image.sh automates the building and deployment of it.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
dataset		dataset
tests		tests
.gitignore		.gitignore
README.md		README.md
context.sql		context.sql
deploy.sh		deploy.sh
download.py		download.py
job.py		job.py
merge.py		merge.py
requirements.txt		requirements.txt
run_datashop.sh		run_datashop.sh
run_job.sh		run_job.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Custom Dataset Creation

Table of Contents

Deployment

Steps to Deploy:

Running the PySpark Job

Steps to Run a CSV raw data job:

Steps to Run a Datashop XML job:

Updating the Custom Docker Image

About

Releases

Packages

Languages

Simon-Initiative/dataset

Folders and files

Latest commit

History

Repository files navigation

Custom Dataset Creation

Table of Contents

Deployment

Steps to Deploy:

Running the PySpark Job

Steps to Run a CSV raw data job:

Steps to Run a Datashop XML job:

Updating the Custom Docker Image

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages