-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark Jupyter getting started docker compose #295
Spark Jupyter getting started docker compose #295
Conversation
I want to make sure this is something we want to do before proceeding to add more to the PR cc @collado-mike / @flyrain |
Make sense to me. Thanks @kevinjqliu! Do we have any doc for its usage? We may add doc if not. |
@flyrain yep i'll have a README in here, similar to the trino one |
Sounds good. We will need these doc to be in the Polaris doc site, like this https://polaris.apache.org/docs/overview/. I couldn't find Trino's doc there, this may involve doc publish and link. cc @jbonofre |
I see, this is the README for trino. I'll add a similar README for spark. As a follow-up, we can change the Polaris doc to refer to these guides https://polaris.apache.org/docs/quickstart |
This looks good to me. We should change the name of the compose file to just |
@collado-mike makes sense, will do. I have a question on slack about unable to assume the role |
e8f2187
to
92a2ad5
Compare
r? @flyrain @RussellSpitzer @collado-mike Also opened #319 to update the Polaris doc site once this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit conflicted about this doc. It feels like it doesn't really teach the reader anything about Polaris, although it does give you a really fast way to get bootstrapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I'll admit, this README is a filler for now as a way to get spark & polaris up and running quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be easier to use the CLI here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be, if you want a spark-shell. I think the jupyter notebook does a good job of explaining a lot of the concepts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant the polaris
CLI instead of using curl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i dont know how to use the polaris CLI, so i just copied directly from https://github.com/apache/polaris/blob/main/regtests/run_spark_sql.sh
e097bb3
to
b75d998
Compare
md check intermittently shows |
It's OK to remove the link for now since we’re transitioning to Hugo. |
@flyrain just had to run the CI a few times, it's unrelated to this change |
b75d998
to
3eda72b
Compare
3eda72b
to
48a9f00
Compare
@@ -41,5 +41,5 @@ jobs: | |||
with: | |||
use-quiet-mode: 'yes' | |||
config-file: '.github/workflows/check-md-link-config.json' | |||
folder-path: 'regtests, regtests/client/python/docs, regtests/client/python, .github, build-logic, polaris-core, polaris-service, extension, spec, k8, notebooks' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR moved notebooks/
from top-level directory into the getting-started/
directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevinjqliu for working on it. LGTM overall. Left some comments and questions.
getting-started/spark/README.md
Outdated
``` | ||
|
||
This will spin up 3 container services | ||
* The `polaris` service for running Apache Polaris |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could we be more explicit that it starts with an in-memory metastore?
getting-started/spark/README.md
Outdated
This will spin up 3 container services | ||
* The `polaris` service for running Apache Polaris | ||
* The `jupyter` service for running Jupyter notebook with PySpark | ||
* The `create-polaris-catalog` service to run setup script and create local catalog in Polaris |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local catalog
-> a catalog backed by the local file system
?
SPARK_BEARER_TOKEN="${REGTEST_ROOT_BEARER_TOKEN:-principal:root;realm:default-realm}" | ||
POLARIS_CATALOG_NAME="${POLARIS_CATALOG_NAME:-polaris_demo}" | ||
|
||
# create a catalog backed by the local filesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure if we need this file. Could we handle everything directly within the notebook, like the other operations in SparkPolaris.ipynb
? Would it simplify things if we moved the operations there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could, but i think its a good idea to separate infra code (this script) from application code (the notebook)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could initialize catalog in notebook as well. I feel it's more flexible that way. for example, you don't have to worry about the an env variable for catalog name. But I'm OK with either one. Not a blocker for me.
|
||
# Getting Started with Apache Spark and Apache Polaris | ||
|
||
This getting started guide provides a `docker-compose` file to set up [Apache Spark](https://spark.apache.org/) with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is other way to try Spark with Polaris without docker, it's a not a blocker, we can add it later.
Thanks for the review @flyrain, addressed your comments |
We cannot merge any PR until #374 is merged. |
Thanks for the heads up, I'll rebase once that PR's merged |
a0f6c9a
to
797fabb
Compare
@flyrain took your advice, moved |
Thanks a lot for working on it, @kevinjqliu! Thanks all for the review. |
Description
This PR moves the
docker-compose-jupyter.yml
file (and thenotebooks/
directory), formerly in the top-level directory, into thegetting-started/spark/
folder.The purpose is to unify the "getting started" guides into the same directory.
Fixes #110
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Open the
SparkPolaris.ipynb
Jupyter notebookGrab the
root principal credentials
from the Polaris service and replace in the notebook cell.Run all cells in notebook
Checklist:
Please delete options that are not relevant.