-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint/Restore fails when container runtime hooks are present #7632
Comments
@adrianreber @kolyshkin PTAL |
@nuwang this looks like an issue with the mentioned hook (rather than with cri-o, runc, or criu). I suggest you file a bug to https://github.com/NVIDIA/gpu-operator |
I am pretty sure no one has tried this before. I also have no experience with any nvidia systems. The CRIU message about the action scripts is also rather unusual. |
Thanks for looking into this. I do no think this is limited to the operator in question, because I've also tried switching to CDI: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html, and that doesn't work either (although with a different set of errors). My understanding of this extremely sketchy, but in the runtime hooks case, the runc json spec appears to be getting modified by the hook to inject the device. In the CDI case (https://github.com/cncf-tags/container-device-interface), what appears to be happening is that the OCI spec is modified by the hook. In both cases, it appears that the sequence of events that unfold during CRIO/CRIU's restoration process doesn't allow hooks to execute in the expected order? |
@nuwang it may be helpful to put together a OCI runtime spec that doesn't use nvidia but does hit this issue, is that possible? |
I don't know at which point the hooks modify
Yes, that would be helpful. If you could share a checkpoint archive which fails we should be able to see which information from the checkpointed container needs to be added to the restored container. We do a lot of changes like this starting here: https://github.com/cri-o/cri-o/blob/main/server/container_restore.go#L152 If we know what nvidia needs we can pull the information from the checkpointed container and add it to the restored container. Or maybe we need to run some hooks also during the restore case. Depends on the change. |
I think the CDI case seems like a better candidate to target for a first pass instead of the runtime hook, because according to this: https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#cri-o-configuration, CDI is enabled in CRI-O by default. That should make reproduction easier and CDI should probably be made to work with CRIO/CRIU regardless? I hadn't even heard of CDI until I ran into this issue, so it's entirely possible that what I'm proposing is incorrect, but the CDI spec itself appears to be fairly simple. It looks like should result in something that could be used in a test case? I have blown away my environment, but I did preserve a docker container with a checkpointed archive. If you need something else let me know, but it might take me a while to recreate the necessary environments. |
Hi @nuwang , how did you make the checkpoint work? I can't even checkpoint when the nvidia container runtime is used (although gpu is not requested):
I believe those mounts are added by Nvidia's |
I added the enable-external-masters option, which can be added to /etc/criu/runc.conf |
@nuwang I tried to add enable-external-masters to /etc/criu/runc.conf, but still see the same mount error. Then I tried to modify criu code to just ignore the nvidia mounts which criu can't handle. The checkpoint passed. Then I see the same restore error as you. |
A friendly reminder that this issue had no activity for 30 days. |
Closing this issue since it had no activity in the past 90 days. |
What happened?
I'm trying to use CRIU to restore a k8s pod. Works perfectly when the nvidia container runtime is not installed. However, when it is installed, the restoration fails, even though the pod itself does not request a GPU. The problem appears to be due to the presence of container runtime hooks installed by nvidia (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/arch-overview.html#the-nvidia-container-runtime-hook). My understanding is that this a preStart hook which injects the nvidia drivers into the container. I can see that failing in the k8s event log with:
What did you expect to happen?
Although the pod is running on a node with a GPU, the pod should successfully restore since it does not use a GPU. It can be successfully checkpointed, just not restored if nvidia runtime hooks are configured.
Alternatively, is there a workaround?
How can we reproduce it (as minimally and precisely as possible)?
tensorflow/tensorflow:latest-gpu-jupyter
. Should work successfully.Anything else we need to know?
I have tried this on Ubuntu 20.04 EKS nodes as well as Amazon Linux 2023. Both produce the same issue. Have also tried CRIU 3.17 and 3.19.
CRI-O and Kubernetes version
OS version
Additional environment details (AWS, VirtualBox, physical, etc.)
CRIU full dump/restore logs:
Output of `criu --version`:
The text was updated successfully, but these errors were encountered: