Question about High Availability for JEG on k8s #1156

chiawchen · 2022-09-15T21:15:34Z

Description

Whenever K8s try to terminate a pod, application will receive a SIGTERM signal [reference], and ideally do the gracefully shutdown; however, I found the line here in JEG,

enterprise_gateway/enterprise_gateway/enterprisegatewayapp.py

Line 343 in 7a9a646

self.shutdown()

it will trigger a shutdown to all the existing kernels, thus existing kernel information will be eliminated even if we have external webhook kernel session persistent [reference on JEG doc]. Did I miss anything about handling the restart happened on server side? This may happen quite frequently depends on upgrading sidecar, upgrading some configuration for JEG, of even simply upgrading the hardcoded kernelspec.

Reproduce

Deploy JEG as k8s service with Replication availability & Webhook Kernel Session Persistence
Connect it through jupyterlab and create an arbitrary remote kernel
Delete one of the JEG replica through kubectl delete pod <pod_name>
Observe the remote kernel been deleted instead of preserving for later re-connection

Expected behavior

Shouldn't shutdown remote kernel, but only shutdown local kernel running on JEG (cuz it's impossible to retrieve back the process)

Context

Operating System and version: Kubernetes v1.18
Browser and version: N/A
Jupyter Server version: 1.18.1
Jupyter Enterprise Gateway: v3.0.0dev

Troubleshoot Output

Paste the output from running `jupyter troubleshoot` from the command line here.
You may want to sanitize the paths in the output.

Command Line Output

Paste the output from your command line running `jupyter lab` here, use `--debug` if possible.

Browser Output

Paste the output from your browser Javascript console here, if applicable.

The text was updated successfully, but these errors were encountered:

kevin-bates · 2022-09-15T22:40:09Z

Hi @chiawchen - yeah, the HA/DR machinery has not been fully resolved. It is primarily intended for hard failures, behaving more like SIGKILL than SIGTERM, where remote kernels are orphaned.

It makes sense to make the automatic kernel shutdown sensitive to failover configuration, although I wonder if it should be an explicit option (so that we don't always orphan remote kernels), at least for now. Perhaps something like terminate_kernels_on_shutdown that defaults to True and must be explicitly set to False. Operators in configurations that need to perform periodic upgrades would then want to set this. If we find the machinery to be solid, we could then tie this option to the HA modes.

Also note that we now support terminationGracePeriodSeconds in the helm chart.

chiawchen · 2022-09-16T01:03:43Z

avoiding orphan remote kernels

make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard

kevin-bates · 2022-09-16T16:47:01Z

avoiding orphan remote kernels

make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard

Later last night I realized that, so long as there's another EG instance running at the time the first gets shutdown or even sometime later, and that "other instance" shares the same kernel persistence store (which is assumed in HA configs), then the only kernel pods to be orphaned would be those in which a user never interacts with following the stopped EG's shutdown. That is, even those kernel pods should become active again by virtue of the "hydration" that occurs when a user interacts with their kernel via interrupt or reconnect, etc.

But, yes, we've talked about introducing some admin-related endpoints - one of which could interrogate the kernel persistence store and compare that with the set of managed kernels (somehow checking with each EG instance) and present of list of currently unmanaged kernels. On Kubernetes, this application could present some of the labels, envs, etc. that reside on the kernel pod to help operators better understand whether they should be hydrated or terminated.

This leads me to wonder if kernel provisioners (and perhaps the older, to be obsoleted, process proxies) should expose a method allowing users to access their "metadata" given a kernel_id (or whatever else is necessary to locate the kernel).

chiawchen added the bug label Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about High Availability for JEG on k8s #1156

Question about High Availability for JEG on k8s #1156

chiawchen commented Sep 15, 2022 •

edited

Loading

kevin-bates commented Sep 15, 2022

chiawchen commented Sep 16, 2022 •

edited

Loading

kevin-bates commented Sep 16, 2022

Question about High Availability for JEG on k8s #1156

Question about High Availability for JEG on k8s #1156

Comments

chiawchen commented Sep 15, 2022 • edited Loading

Description

Reproduce

Expected behavior

Context

kevin-bates commented Sep 15, 2022

chiawchen commented Sep 16, 2022 • edited Loading

kevin-bates commented Sep 16, 2022

chiawchen commented Sep 15, 2022 •

edited

Loading

chiawchen commented Sep 16, 2022 •

edited

Loading