Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark operator + jupyter notebook? #1652

Closed
tamis-laan opened this issue Dec 6, 2022 · 13 comments
Closed

Spark operator + jupyter notebook? #1652

tamis-laan opened this issue Dec 6, 2022 · 13 comments

Comments

@tamis-laan
Copy link

We are running the spark k8s operator in order to process data using the yaml spec in production. This works great but we also want to do exploratory data analyses using Jupyter notebooks. Is this possible using the spark k8s operator?

@tafaust
Copy link

tafaust commented Jan 2, 2023

@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter.
Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)

Here is my enterprise-gateway istio issue for reference: jupyter-server/enterprise_gateway#1168

@tamis-laan
Copy link
Author

@tamis-laan you can do it with the https://github.com/jupyter-server/enterprise_gateway backend for Jupyter. Side note: enterprise-gatway does run on k8s as a backend for your Jupyterhub but it does not follow the k8s best practices. The kernel pods do NOT start a service. I have not been able to get k8s service meshes (istio in my case) running yet. If that's not a requirement for you, your good to go. If you use the istio service mesh and are able to fix it - I'm all ears. :)

Here is my enterprise-gateway istio issue for reference: jupyter-server/enterprise_gateway#1168

I discovered Spark also allows for executing jobs directly on kubernetes:
https://spark.apache.org/docs/latest/running-on-kubernetes.html

When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.

So I'm not sure how the jupyter enterprise gateway differs from this setup? Also it says it doesn't provide Jupyter hub so it's not possible to have multiple users with multiple notebooks. Which one is better/preferred?

@tafaust
Copy link

tafaust commented Jan 2, 2023

Jupyterhub extends Jupyter notebooks (see https://zero-to-jupyterhub.readthedocs.io/en/stable/_images/architecture.png).

Thus, Jupyterhub starts a Jupyterlab or plain Jupyter Notebook for your user within a pod and does some management around that.
A Jupyter Notebook can have a remote backend (https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#using-a-gateway-server-for-kernel-management) and enterprise-gateway is an implementation for that. enterprise-gateway runs in various resource managers including kubernetes. It acts as some sort of operator that spawns your kernels as pods in kubernetes (allowing for horizontal scaling of your kernel, e.g. python runtime.

I discovered Spark also allows for executing jobs directly on kubernetes:
https://spark.apache.org/docs/latest/running-on-kubernetes.html

Yes. The upstream Apache Spark can spawn a driver which in turn spawns N many executors. The executors run your code. However, they do not have an interactive mode. Jupyter Notebooks are interactive in that they spawn a runtime and do not run a script.

When you use spark-submit and you literally point spark at your kubernetes API server k8s://https://: it will starts workers there as pods. Thus when you run jupyterhub in your cluster it should be possible using pyspark to start jobs on your cluster directly.

See my argument regarding interactivity. AFAIU it won't work.

However, if running individual Spark Jobs from your Jupyter Notebook (instead of your Jupyter Notebook kernel in Spark), check out:
https://github.com/TIBCOSoftware/snappy-on-k8s/blob/master/charts/jupyter-with-spark/README.md

@Shrinjay
Copy link

@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?

@avishayse
Copy link

Following -- looking forward to this.
how you connect the operator when i want to run this in a large scale with many users? -- Seems that zero-to-jupyterhub are good fit.
can i execute the code in Jupyter in cluster mode as well? thanks.

@tafaust
Copy link

tafaust commented Jan 15, 2023

@tahesse Thanks for providing a starting point! I'm curious as to how the enterprise gateway kernel would work with the operator though? I understand that enterprise gateway would allow us to run the Jupyter kernel in a k8s cluster, but I'm curious as to how this would enable us to submit jobs using the operator? Would the kernel generate job manifests?

@Shrinjay WDYM with job manifests? AFAIU enterprise-gateway is an operator itself. Your JupyterHub or JupyterLab (frontend-wise) communicates with enterprise-gateway if configured properly.
In the operators kernel-launchers (here: https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/launch_custom_resource.py#L66-L68) it will load the declaration template (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/operators/scripts/sparkoperator.k8s.io-v1beta2.yaml.j2) which will be submitted to kubernetes.
The spark job spawns a jupyter kernel (https://github.com/jupyter-server/enterprise_gateway/blob/main/etc/kernel-launchers/python/scripts/launch_ipykernel.py) which keeps communicating via socket with the enterprise-gateway.

Note that they do not spawn a service with the jupyter kernel pod (I thus didn't manage to make it work with istio for that very reason... but I'm out of ideas right now).

I hope my explanation clears up some of the confusions.

@Wh1isper
Copy link

Wh1isper commented Jul 27, 2023

Perhaps what you need is PySpark SparkSession via client connect to a Spark Connect Server in Spark 3.4.0 sc mode: https://spark.apache.org/docs/latest/spark-connect-overview.html

I've developed a module for deploying the latest 3.4.0 server-client mode on k8s and support config PySpark Session for direct connections. How about check this out?
Wh1isper/sparglim#spark-connect-server-on-k8s

Alternatively, PySpark Session can be deployed in client mode on k8s, also avaliable in https://github.com/Wh1isper/sparglim#pyspark-app

@JWDobken
Copy link

JWDobken commented Apr 9, 2024

I got this working quit simple.

Following this explainer about running Spark in client mode: https://medium.com/@sephinreji98/understanding-spark-cluster-modes-client-vs-cluster-vs-local-d3c41ea96073

deploy Jupyter Spark manifest

Include an headless service to run in client mode and provide the spark service account to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter
  labels:
    app: jupyter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter
  template:
    metadata:
      labels:
        app: jupyter
    spec:
      containers:
        - name: jupyter
          image: jupyter/pyspark-notebook:spark-3.5.0
          resources:
            requests:
              memory: 4096Mi
            limits:
              memory: 4096Mi
          env:
            - name: JUPYTER_PORT
              value: "8888"
          ports:
            - containerPort: 8888
      serviceAccount: spark
      serviceAccountName: spark
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter
spec:
  type: ClusterIP
  selector:
    app: jupyter
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-headless
spec:
  clusterIP: None
  selector:
    app: jupyter

You can port-forward the jupyter service on port 8888 and use the access token from the logs.

Connecting to Spark Operator

I got all configs from the documentation:

import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("JupyterApp")
    .master("k8s://https://kubernetes.default.svc.cluster.local:443")
    .config("spark.submit.deployMode", "client")
    .config("spark.executor.instances", "1")
    .config("spark.executor.memory", "1G")
    .config("spark.driver.memory", "1G")
    .config("spark.executor.cores", "1")
    .config("spark.kubernetes.namespace", "default")
    .config(
        "spark.kubernetes.container.image", "ghcr.io/apache/spark-docker/spark:3.5.0"
    )
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    .config("spark.kubernetes.driver.pod.name", os.environ["HOSTNAME"])
    .config("spark.driver.bindAddress", "0.0.0.0")
    .config("spark.driver.host", "jupyter-headless.default.svc.cluster.local")
    .getOrCreate()
)

This will create the executor pod with jupyter as the client:

❯ kubectl get po -n default
NAME                                 READY   STATUS      RESTARTS   AGE
jupyter-7495cfdddc-864rd             1/1     Running     0          2m7s
jupyterapp-d3c7258ec363aa87-exec-1   1/1     Running     0          88s

If you have any issue, questions and/or improvements. Let me know!

@pivettamarcos
Copy link

@JWDobken It works, but doesn't communicate with the operator pod, is that right? I don't even need it running it seems.

@Wh1isper
Copy link

Wh1isper commented Apr 20, 2024

@pivettamarcos

but doesn't communicate with the operator pod, is that right?

If I understand correctly, it's using client mode as said, which is a different mode than submitting tasks, so naturally there's no operator involved.

I've been using this feature since spark 3.1.2, and if you need to build services via spark, this is lower latency than submitting tasks, but more costly to maintain (you may need to design the task's message queue). I built a simple GRPC data sampling service: https://github.com/Wh1isper/pyspark-sampling/ and provide an SDK for using spark in client mode or deploy connect service in k8s: https://github.com/Wh1isper/sparglim

If you want to use a k8s cluster deployed spark remotely in jupyter (via client-server mode), I highly recommend you try https://github.com/Wh1isper/sparglim, given that I haven't seen any official documentation at this point (if there is, with thanks to anyone who can tell me!).

@parthweprom
Copy link

How to delete pod once it is in an Error state? @JWDobken

Copy link

github-actions bot commented Nov 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants