Spark operator + jupyter notebook? #1652

tamis-laan · 2022-12-06T13:17:21Z

We are running the spark k8s operator in order to process data using the yaml spec in production. This works great but we also want to do exploratory data analyses using Jupyter notebooks. Is this possible using the spark k8s operator?

tafaust · 2023-01-02T09:47:33Z

@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter.
Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)

Here is my enterprise-gateway istio issue for reference: jupyter-server/enterprise_gateway#1168

tamis-laan · 2023-01-02T16:03:32Z

@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter. Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)

Here is my enterprise-gateway istio issue for reference: jupyter-server/enterprise_gateway#1168

I discovered Spark also allows for executing jobs directly on kubernetes:
https://spark.apache.org/docs/latest/running-on-kubernetes.html

When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.

So I'm not sure how the jupyter enterprise gateway differs from this setup? Also it says it doesn't provide Jupyter hub so it's not possible to have multiple users with multiple notebooks. Which one is better/preferred?

tafaust · 2023-01-02T19:17:56Z

Jupyterhub extends Jupyter notebooks (see https://zero-to-jupyterhub.readthedocs.io/en/stable/_images/architecture.png).

Thus, Jupyterhub starts a Jupyterlab or plain Jupyter Notebook for your user within a pod and does some management around that.
A Jupyter Notebook can have a remote backend (https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#using-a-gateway-server-for-kernel-management) and enterprise-gateway is an implementation for that. enterprise-gateway runs in various resource managers including kubernetes. It acts as some sort of operator that spawns your kernels as pods in kubernetes (allowing for horizontal scaling of your kernel, e.g. python runtime.

I discovered Spark also allows for executing jobs directly on kubernetes:
https://spark.apache.org/docs/latest/running-on-kubernetes.html

Yes. The upstream Apache Spark can spawn a driver which in turn spawns N many executors. The executors run your code. However, they do not have an interactive mode. Jupyter Notebooks are interactive in that they spawn a runtime and do not run a script.

When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://: it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.

See my argument regarding interactivity. AFAIU it won't work.

However, if running individual Spark Jobs from your Jupyter Notebook (instead of your Jupyter Notebook kernel in Spark), check out:
https://github.com/TIBCOSoftware/snappy-on-k8s/blob/master/charts/jupyter-with-spark/README.md

Shrinjay · 2023-01-10T01:44:22Z

@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?

avishayse · 2023-01-14T23:36:11Z

Following -- looking forward to this.
how you connect the operator when i want to run this in a large scale with many users? -- Seems that zero-to-jupyterhub are good fit.
can i execute the code in Jupyter in cluster mode as well? thanks.

tafaust · 2023-01-15T11:14:37Z

@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?

@Shrinjay WDYM with job manifests? AFAIU enterprise-gateway is an operator itself. Your JupyterHub or JupyterLab (frontend-wise) communicates with enterprise-gateway if configured properly.
In the operators kernel-launchers (here: https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/launch_custom_resource.py#L66-L68) it will load the declaration template (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/sparkoperator.k8s.io-v1beta2.yaml.j2) which will be submitted to kubernetes.
The spark job spawns a jupyter kernel (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/python/scripts/launch_ipykernel.py) which keeps communicating via socket with the enterprise-gateway.

Note that they do not spawn a service with the jupyter kernel pod (I thus didn't manage to make it work with istio for that very reason... but I'm out of ideas right now).

I hope my explanation clears up some of the confusions.

Wh1isper · 2023-07-27T05:32:39Z

Perhaps what you need is PySpark SparkSession via client connect to a Spark Connect Server in Spark 3.4.0 sc mode: https://spark.apache.org/docs/latest/spark-connect-overview.html

I've developed a module for deploying the latest 3.4.0 server-client mode on k8s and support config PySpark Session for direct connections. How about check this out?
Wh1isper/sparglim#spark-connect-server-on-k8s

Alternatively, PySpark Session can be deployed in client mode on k8s, also avaliable in https://github.com/Wh1isper/sparglim#pyspark-app

JWDobken · 2024-04-09T13:30:08Z

I got this working quit simple.

Following this explainer about running Spark in client mode: https://medium.com/@sephinreji98/understanding-spark-cluster-modes-client-vs-cluster-vs-local-d3c41ea96073

deploy Jupyter Spark manifest

Include an headless service to run in client mode and provide the spark service account to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter
  labels:
    app: jupyter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter
  template:
    metadata:
      labels:
        app: jupyter
    spec:
      containers:
        - name: jupyter
          image: jupyter/pyspark-notebook:spark-3.5.0
          resources:
            requests:
              memory: 4096Mi
            limits:
              memory: 4096Mi
          env:
            - name: JUPYTER_PORT
              value: "8888"
          ports:
            - containerPort: 8888
      serviceAccount: spark
      serviceAccountName: spark
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter
spec:
  type: ClusterIP
  selector:
    app: jupyter
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-headless
spec:
  clusterIP: None
  selector:
    app: jupyter

You can port-forward the jupyter service on port 8888 and use the access token from the logs.

Connecting to Spark Operator

I got all configs from the documentation:

import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("JupyterApp")
    .master("k8s://https://kubernetes.default.svc.cluster.local:443")
    .config("spark.submit.deployMode", "client")
    .config("spark.executor.instances", "1")
    .config("spark.executor.memory", "1G")
    .config("spark.driver.memory", "1G")
    .config("spark.executor.cores", "1")
    .config("spark.kubernetes.namespace", "default")
    .config(
        "spark.kubernetes.container.image", "ghcr.io/apache/spark-docker/spark:3.5.0"
    )
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    .config("spark.kubernetes.driver.pod.name", os.environ["HOSTNAME"])
    .config("spark.driver.bindAddress", "0.0.0.0")
    .config("spark.driver.host", "jupyter-headless.default.svc.cluster.local")
    .getOrCreate()
)

This will create the executor pod with jupyter as the client:

❯ kubectl get po -n default
NAME                                 READY   STATUS      RESTARTS   AGE
jupyter-7495cfdddc-864rd             1/1     Running     0          2m7s
jupyterapp-d3c7258ec363aa87-exec-1   1/1     Running     0          88s

If you have any issue, questions and/or improvements. Let me know!

pivettamarcos · 2024-04-19T20:24:59Z

@JWDobken It works, but doesn't communicate with the operator pod, is that right? I don't even need it running it seems.

Wh1isper · 2024-04-20T04:17:03Z

@pivettamarcos

but doesn't communicate with the operator pod, is that right?

If I understand correctly, it's using client mode as said, which is a different mode than submitting tasks, so naturally there's no operator involved.

I've been using this feature since spark 3.1.2, and if you need to build services via spark, this is lower latency than submitting tasks, but more costly to maintain (you may need to design the task's message queue). I built a simple GRPC data sampling service: https://github.com/Wh1isper/pyspark-sampling/ and provide an SDK for using spark in client mode or deploy connect service in k8s: https://github.com/Wh1isper/sparglim

If you want to use a k8s cluster deployed spark remotely in jupyter (via client-server mode), I highly recommend you try https://github.com/Wh1isper/sparglim, given that I haven't seen any official documentation at this point (if there is, with thanks to anyone who can tell me!).

parthweprom · 2024-08-07T16:01:34Z

How to delete pod once it is in an Error state? @JWDobken

github-actions · 2024-11-05T16:05:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-11-25T16:05:36Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

This was referenced Jul 27, 2023

[question] Native kubernetes spark vs spark-on-k8s-operator? #1657

Closed

Deploy mode client ? #1406

Closed

JWDobken mentioned this issue Apr 9, 2024

how does it integrate with jupyter hub ? #717

Closed

tafaust mentioned this issue Sep 1, 2024

enterprise-gateway does not connect to k8s kernel when istio is configured jupyter-server/enterprise_gateway#1168

Open

github-actions bot added the lifecycle/stale label Nov 5, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark operator + jupyter notebook? #1652

Spark operator + jupyter notebook? #1652

tamis-laan commented Dec 6, 2022

tafaust commented Jan 2, 2023 •

edited

Loading

tamis-laan commented Jan 2, 2023

tafaust commented Jan 2, 2023

Shrinjay commented Jan 10, 2023

avishayse commented Jan 14, 2023

tafaust commented Jan 15, 2023

Wh1isper commented Jul 27, 2023 •

edited

Loading

JWDobken commented Apr 9, 2024 •

edited

Loading

pivettamarcos commented Apr 19, 2024

Wh1isper commented Apr 20, 2024 •

edited

Loading

parthweprom commented Aug 7, 2024

github-actions bot commented Nov 5, 2024

github-actions bot commented Nov 25, 2024

Spark operator + jupyter notebook? #1652

Spark operator + jupyter notebook? #1652

Comments

tamis-laan commented Dec 6, 2022

tafaust commented Jan 2, 2023 • edited Loading

tamis-laan commented Jan 2, 2023

tafaust commented Jan 2, 2023

Shrinjay commented Jan 10, 2023

avishayse commented Jan 14, 2023

tafaust commented Jan 15, 2023

Wh1isper commented Jul 27, 2023 • edited Loading

JWDobken commented Apr 9, 2024 • edited Loading

pivettamarcos commented Apr 19, 2024

Wh1isper commented Apr 20, 2024 • edited Loading

parthweprom commented Aug 7, 2024

github-actions bot commented Nov 5, 2024

github-actions bot commented Nov 25, 2024

tafaust commented Jan 2, 2023 •

edited

Loading

Wh1isper commented Jul 27, 2023 •

edited

Loading

JWDobken commented Apr 9, 2024 •

edited

Loading

Wh1isper commented Apr 20, 2024 •

edited

Loading