-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about High Availability for JEG on k8s #1156
Comments
Hi @chiawchen - yeah, the HA/DR machinery has not been fully resolved. It is primarily intended for hard failures, behaving more like It makes sense to make the automatic kernel shutdown sensitive to failover configuration, although I wonder if it should be an explicit option (so that we don't always orphan remote kernels), at least for now. Perhaps something like Also note that we now support |
make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard |
Later last night I realized that, so long as there's another EG instance running at the time the first gets shutdown or even sometime later, and that "other instance" shares the same kernel persistence store (which is assumed in HA configs), then the only kernel pods to be orphaned would be those in which a user never interacts with following the stopped EG's shutdown. That is, even those kernel pods should become active again by virtue of the "hydration" that occurs when a user interacts with their kernel via interrupt or reconnect, etc. But, yes, we've talked about introducing some admin-related endpoints - one of which could interrogate the kernel persistence store and compare that with the set of managed kernels (somehow checking with each EG instance) and present of list of currently unmanaged kernels. On Kubernetes, this application could present some of the labels, envs, etc. that reside on the kernel pod to help operators better understand whether they should be hydrated or terminated. This leads me to wonder if kernel provisioners (and perhaps the older, to be obsoleted, process proxies) should expose a method allowing users to access their "metadata" given a kernel_id (or whatever else is necessary to locate the kernel). |
Description
Whenever K8s try to terminate a pod, application will receive a SIGTERM signal [reference], and ideally do the gracefully shutdown; however, I found the line here in JEG,
enterprise_gateway/enterprise_gateway/enterprisegatewayapp.py
Line 343 in 7a9a646
it will trigger a shutdown to all the existing kernels, thus existing kernel information will be eliminated even if we have external webhook kernel session persistent [reference on JEG doc]. Did I miss anything about handling the restart happened on server side? This may happen quite frequently depends on upgrading sidecar, upgrading some configuration for JEG, of even simply upgrading the hardcoded kernelspec.
Reproduce
kubectl delete pod <pod_name>
Expected behavior
Shouldn't shutdown remote kernel, but only shutdown local kernel running on JEG (cuz it's impossible to retrieve back the process)
Context
Troubleshoot Output
Command Line Output
Browser Output
The text was updated successfully, but these errors were encountered: