Skip to content

Commit

Permalink
Merge pull request #1 from Sanofi-Public/dev_readme_improvements
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
uvashisth authored Apr 22, 2024
2 parents 66169be + d776776 commit bd88746
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 24 deletions.
31 changes: 19 additions & 12 deletions GETTING_STARTED.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ When you're ready to run and monitor a job in EMR, you may encounter various sce


## Scenario 1: Main Script Contains Pure PySpark Code and Is Standalone
### Recommendation
- There's no need to build the dependency package (library package or Python module package) using the `package-dependencies` command.
- Simply submit the job using the `run` command.
> [!TIP]
> - There's no need to build the dependency package (library package or Python module package) using the `package-dependencies` command.
> - Simply submit the job using the `run` command.
Example `emr_job.py` script:
```python
Expand Down Expand Up @@ -92,9 +92,13 @@ print(emr_job_id)
```

## Scenario 2: Main Script depends on external libraries not installed in EMR and is Standalone
### Recommendation
- Build the dependency package (that only includes library package) using `package-dependencies` command
- Simply submit the job via `run` command

> [!NOTE]
> EMR only supports core libraries: [emr-release](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html). If you have external libraries required for your script, this is the scenario for you.
> [!TIP]
> * Build the dependency package (that only includes library package) using `package-dependencies` command.
> * Simply submit the job via `run` command

```python
Expand Down Expand Up @@ -220,9 +224,9 @@ print(emr_job_id)
```
## Scenario 3: Main Script is not standalone. Moreover, it requires dependent Python modules from the project, such as src/utils, logging.yaml, config, etc. These additional dependencies are in the project.

### Recommendation
- Build the dependency package (that only includes project package) using `package-dependencies` command
- Simply submit the job via `run` command
> [!TIP]
> - Build the dependency package (that only includes project package) using `package-dependencies` command
> - Simply submit the job via `run` command
```python
# main.py
Expand Down Expand Up @@ -333,9 +337,12 @@ print(emr_job_id)
```
## Scenario 4: The Main Script is not standalone. Additionally, it requires both Python modules and Python libraries in EMR

### Recommendation
- Build the dependency package (includes both project and library) using `package-dependencies` command
- Simply submit the job via `run` command
> [!NOTE]
> EMR only supports core libraries: [emr-release](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html). If you have external libraries required for your script, this is the scenario for you.
> [!TIP]
> - Build the dependency package (includes both project and library) using `package-dependencies` command
> - Simply submit the job via `run` command
```python
# main.py
Expand Down
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# EMRFLOW :cyclone:
# EMRFlow :cyclone:


<span style="color:purple;">**EMRFLOW** </span> is designed to simplify the process of running PySpark jobs on Amazon EMR (Elastic Map Reduce). It abstracts the complexities of interacting with EMR APIs and provides an intuitive command-line interface to effortlessly submit, monitor, and list your EMR PySpark jobs.
<span style="color:purple;">**EMRFlow** </span> is designed to simplify the process of running PySpark jobs on [Amazon EMR](https://aws.amazon.com/emr/) (Elastic Map Reduce). It abstracts the complexities of interacting with EMR APIs and provides an intuitive command-line interface and python library to effortlessly submit, monitor, and list your EMR PySpark jobs.

<span style="color:purple;">**EMRFLOW** </span> serves as both a library and a command-line tool.
<span style="color:purple;">**EMRFlow** </span> serves as both a library and a command-line tool.

To install `EMRFLOW`, please run:
To install `EMRFlow`, please run:

```bash
pip install emrflow
```
## Configuration

Create an `emr_serverless_config.json` file containing the specified details and store it in your workbench's home directory
Create an `emr_serverless_config.json` file containing the specified details and store it in your home directory
```json
{
"application_id": "",
Expand All @@ -22,8 +22,9 @@ Create an `emr_serverless_config.json` file containing the specified details and
```

## Usage
Please read the [GETTING STARTED](GETTING_STARTED.md) to integrate <span style="color:purple;">**EMRFlow** </span> into your project.

<span style="color:purple;">**EMRFLOW** </span> offers several commands to manage your Pypark jobs. Let's explore some of the key functionalities:
<span style="color:purple;">**EMRFlow** </span> offers several commands to manage your Pypark jobs. Let's explore some key functionalities:


### Help
Expand All @@ -34,18 +35,21 @@ emrflow serverless --help


### Package Dependencies

You will need to package dependencies before running an EMR job if you have external libraries needing to be installed or local imports from your code base. See Scenario 2-4 in [GETTING STARTED](GETTING_STARTED.md).
```bash
emrflow serverless package-dependencies --help
```
![Serverless Options](images/emr-serverless-package-dependencies-help.png)




### Submit PySpark Job
```bash
emrflow serverless run --help
```
![Serverless Options](images/emr-serverless-run-help.png)


### Submit PySpark Job
```bash
emrflow serverless run \
--job-name "<job-name>" \
Expand Down Expand Up @@ -105,15 +109,13 @@ emr_job_id = emr_serverless.run(
print(emr_job_id)
```

Please read the [GETTING STARTED](GETTING_STARTED.md) to integrated <span style="color:purple;">**EMRFLOW** </span> into your project


**And so much more.......!!!**


## Contributing

We welcome contributions to EMRFLOW. Please open issue and discussing the change you would like to see. Creating a feature branch to work on that issue.
We welcome contributions to EMRFlow. Please open an issue discussing the change you would like to see. Create a feature branch to work on that issue and open a Pull Request once it is ready for review.

### Code style

Expand Down

0 comments on commit bd88746

Please sign in to comment.