Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint/Restore fails when container runtime hooks are present #7632

Closed
nuwang opened this issue Dec 22, 2023 · 12 comments
Closed

Checkpoint/Restore fails when container runtime hooks are present #7632

nuwang opened this issue Dec 22, 2023 · 12 comments
Labels
checkpoint/restore kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@nuwang
Copy link

nuwang commented Dec 22, 2023

What happened?

I'm trying to use CRIU to restore a k8s pod. Works perfectly when the nvidia container runtime is not installed. However, when it is installed, the restoration fails, even though the pod itself does not request a GPU. The problem appears to be due to the presence of container runtime hooks installed by nvidia (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/arch-overview.html#the-nvidia-container-runtime-hook). My understanding is that this a preStart hook which injects the nvidia drivers into the container. I can see that failing in the k8s event log with:

"Error: failed to restore container tf-notebook: container restore failed: time="2023-12-22T09:14:17Z" level=error msg="error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: mount error: mount operation failed: /var/lib/containers/storage/overlay/fdc919dd15499b9af16593868b23eb3c1573c489e9f12bcbf2c0aa86bf59d014/merged/proc/driver/nvidia: no such file or directory\n"

What did you expect to happen?

Although the pod is running on a node with a GPU, the pod should successfully restore since it does not use a GPU. It can be successfully checkpointed, just not restored if nvidia runtime hooks are configured.

Alternatively, is there a workaround?

How can we reproduce it (as minimally and precisely as possible)?

  1. Configure the nvidia container runtime hooks for cri-o (I used the nvidia gpu operator to do this: https://github.com/NVIDIA/gpu-operator). This can be done by installing the NVIDIA GPU operator v23.9.0 helm chart with the following value set: operator.defaultRuntime=crio
  2. Launch a node without a gpu and checkpoint/restore the following container: tensorflow/tensorflow:latest-gpu-jupyter. Should work successfully.
  3. Launch a node with a gpu and restore the same container - fails.

Anything else we need to know?

I have tried this on Ubuntu 20.04 EKS nodes as well as Amazon Linux 2023. Both produce the same issue. Have also tried CRIU 3.17 and 3.19.

CRI-O and Kubernetes version

$ crio --version
# paste output here
crio version 1.27.1
Version:        1.27.1
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
BuildDate:      2023-12-04T18:20:22Z
GoVersion:      go1.21.3
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:
  rpm_crashtraceback
  exclude_graphdriver_btrfs
  btrfs_noversion
  exclude_graphdriver_devicemapper
  libdm_no_deferred_remove
  seccomp
  selinux
LDFlags:          -X github.com/cri-o/cri-o/internal/pkg/criocli.DefaultsPath= -X  github.com/cri-o/cri-o/internal/version.buildDate=2023-12-04T18:20:22Z -X  github.com/cri-o/cri-o/internal/version.gitCommit=65a8134d7c4722e1c39d0e1c473532a17c240682 -X  github.com/cri-o/cri-o/internal/version.version=1.27.1 -X  github.com/cri-o/cri-o/internal/version.gitTreeState=clean  -B 0x0c580110e194b9773bdac2711b57cc3578edc6f0 -extldflags '-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  ' -compressdwarf=false
SeccompEnabled:   true
AppArmorEnabled:  false
$ kubectl version --output=json
# paste output here
{
  "clientVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.3",
    "gitCommit": "a8a1abc25cad87333840cd7d54be2efaf31a3177",
    "gitTreeState": "clean",
    "buildDate": "2023-10-18T11:42:52Z",
    "goVersion": "go1.20.10",
    "compiler": "gc",
    "platform": "darwin/amd64"
  },
  "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3",
  "serverVersion": {
    "major": "1",
    "minor": "27+",
    "gitVersion": "v1.27.8-eks-8cb36c9",
    "gitCommit": "fca3a8722c88c4dba573a903712a6feaf3c40a51",
    "gitTreeState": "clean",
    "buildDate": "2023-11-22T21:52:13Z",
    "goVersion": "go1.20.11",
    "compiler": "gc",
    "platform": "linux/amd64"
  }

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-15"
$ uname -a
# paste output here
Linux ip-192-168-13-151.ap-southeast-1.compute.internal 6.1.66-91.160.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 13 04:50:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

CRIU full dump/restore logs:

(00.000000) Unable to get $HOME directory, local configuration file will not be used.
(00.000000) Parsing config file /etc/criu/runc.conf
(00.000019) Version: 3.17.1 (gitid 0)
(00.000026) Running on ip-192-168-13-151.ap-southeast-1.compute.internal Linux 6.1.66-91.160.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 13 04:50:24 UTC 2023 x86_64
(00.000028) Would overwrite RPC settings with values from /etc/criu/runc.conf
(00.000042) Loaded kdat cache from /run/criu/criu.kdat
(00.000060) Hugetlb size 2 Mb is supported but cannot get dev's number
(00.000069) Hugetlb size 1024 Mb is supported but cannot get dev's number
(00.000102) Added ipc:/var/run/ipcns/144e1a5f-e236-49a1-80f1-cfabccc9cc1e join namespace
(00.000107) Added uts:/var/run/utsns/144e1a5f-e236-49a1-80f1-cfabccc9cc1e join namespace
(00.000116) Parsing config file /etc/criu/runc.conf
(00.000169) Will drop all TCP connections on restore
(00.000171) Will allow link remaps on FS
(00.000172) Will skip non-existent sysctls on restore
(00.000181) rlimit: RLIMIT_NOFILE unlimited for self
(00.000227) cpu: x86_family 6 x86_vendor_id GenuineIntel x86_model_id Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
(00.000232) cpu: fpu: xfeatures_mask 0x2f5 xsave_size 2696 xsave_size_max 2696 xsaves_size 2568
(00.000246) cpu: fpu: x87 floating point registers     xstate_offsets      0 / 0      xstate_sizes    160 / 160
(00.000248) cpu: fpu: AVX registers                    xstate_offsets    576 / 576    xstate_sizes    256 / 256
(00.000250) cpu: fpu: MPX CSR                          xstate_offsets   1024 / 832    xstate_sizes     64 / 64
(00.000251) cpu: fpu: AVX-512 opmask                   xstate_offsets   1088 / 896    xstate_sizes     64 / 64
(00.000253) cpu: fpu: AVX-512 Hi256                    xstate_offsets   1152 / 960    xstate_sizes    512 / 512
(00.000254) cpu: fpu: AVX-512 ZMM_Hi256                xstate_offsets   1664 / 1472   xstate_sizes   1024 / 1024
(00.000256) cpu: fpu: Protection Keys User registers   xstate_offsets   2688 / 2496   xstate_sizes      8 / 8
(00.000257) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1
(00.000280) kernel pid_max=4194304
(00.000281) Reading image tree
(00.000297) Add mnt ns 13 pid 1
(00.000298) Add net ns 10 pid 1
(00.000300) Add pid ns 9 pid 1
(00.000303) pstree pid_max=1
(00.000308) Will restore in 6c020000 namespaces
(00.000313) NS mask to use 6c020000
(00.000348) Collecting 51/56 (flags 3)
(00.000355) No memfd.img image
(00.000357)  `- ... done
(00.000358) Collecting 40/54 (flags 2)
(00.000367) Collected [usr/bin/python3.11] ID 0x1
(00.000370) Collected [usr/lib/python3.11/lib-dynload/_multibytecodec.cpython-311-x86_64-linux-gnu.so] ID 0x2
(00.000372) Collected [usr/local/lib/python3.11/dist-packages/charset_normalizer/md__mypyc.cpython-311-x86_64-linux-gnu.so] ID 0x3
(00.000374) Collected [usr/lib/x86_64-linux-gnu/libsqlite3.so.0.8.6] ID 0x4
(00.000376) Collected [usr/lib/python3.11/lib-dynload/_sqlite3.cpython-311-x86_64-linux-gnu.so] ID 0x5
(00.000378) Collected [usr/lib/python3.11/lib-dynload/_lsprof.cpython-311-x86_64-linux-gnu.so] ID 0x6
(00.000380) Collected [usr/lib/x86_64-linux-gnu/libdl.so.2] ID 0x7
(00.000382) Collected [usr/local/lib/python3.11/dist-packages/rpds/rpds.cpython-311-x86_64-linux-gnu.so] ID 0x8
(00.000384) Collected [usr/lib/python3.11/lib-dynload/resource.cpython-311-x86_64-linux-gnu.so] ID 0x9
(00.000386) Collected [usr/lib/python3.11/lib-dynload/mmap.cpython-311-x86_64-linux-gnu.so] ID 0xa
(00.000387) Collected [usr/lib/x86_64-linux-gnu/libuuid.so.1.3.0] ID 0xb
(00.000389) Collected [usr/local/lib/python3.11/dist-packages/markupsafe/_speedups.cpython-311-x86_64-linux-gnu.so] ID 0xc
(00.000391) Collected [usr/lib/x86_64-linux-gnu/liblzma.so.5.2.5] ID 0xd
(00.000393) Collected [usr/lib/python3.11/lib-dynload/_lzma.cpython-311-x86_64-linux-gnu.so] ID 0xe
(00.000395) Collected [usr/lib/x86_64-linux-gnu/libbz2.so.1.0.4] ID 0xf
(00.000397) Collected [usr/lib/x86_64-linux-gnu/libmpdec.so.2.5.1] ID 0x10
(00.000401) Collected [usr/lib/python3.11/lib-dynload/_decimal.cpython-311-x86_64-linux-gnu.so] ID 0x11
(00.000405) Collected [usr/lib/x86_64-linux-gnu/libtinfo.so.6.3] ID 0x12
(00.000407) Collected [usr/lib/x86_64-linux-gnu/libncursesw.so.6.3] ID 0x13
(00.000409) Collected [usr/lib/python3.11/lib-dynload/_curses.cpython-311-x86_64-linux-gnu.so] ID 0x14
(00.000411) Collected [usr/local/lib/python3.11/dist-packages/charset_normalizer/md.cpython-311-x86_64-linux-gnu.so] ID 0x15
(00.000413) Collected [usr/local/lib/python3.11/dist-packages/yaml/_yaml.cpython-311-x86_64-linux-gnu.so] ID 0x16
(00.000415) Collected [usr/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so] ID 0x17
(00.000420) Collected [usr/lib/x86_64-linux-gnu/libcrypto.so.3] ID 0x18
(00.000422) Collected [usr/lib/x86_64-linux-gnu/libssl.so.3] ID 0x19
(00.000426) Collected [usr/lib/python3.11/lib-dynload/_ssl.cpython-311-x86_64-linux-gnu.so] ID 0x1a
(00.000428) Collected [usr/lib/python3.11/lib-dynload/_json.cpython-311-x86_64-linux-gnu.so] ID 0x1b
(00.000429) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/utils.cpython-311-x86_64-linux-gnu.so] ID 0x1c
(00.000431) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/error.cpython-311-x86_64-linux-gnu.so] ID 0x1d
(00.000433) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/_version.cpython-311-x86_64-linux-gnu.so] ID 0x1e
(00.000436) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/_proxy_steerable.cpython-311-x86_64-linux-gnu.so] ID 0x1f
(00.000438) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/_poll.cpython-311-x86_64-linux-gnu.so] ID 0x20
(00.000440) Collected [usr/lib/python3.11/lib-dynload/_uuid.cpython-311-x86_64-linux-gnu.so] ID 0x21
(00.000442) Collected [usr/lib/python3.11/lib-dynload/_bz2.cpython-311-x86_64-linux-gnu.so] ID 0x22
(00.000446) Collected [usr/lib/python3.11/lib-dynload/termios.cpython-311-x86_64-linux-gnu.so] ID 0x23
(00.000447) Collected [usr/lib/python3.11/lib-dynload/_hashlib.cpython-311-x86_64-linux-gnu.so] ID 0x24
(00.000449) Collected [usr/lib/python3.11/lib-dynload/_queue.cpython-311-x86_64-linux-gnu.so] ID 0x25
(00.000451) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/message.cpython-311-x86_64-linux-gnu.so] ID 0x26
(00.000453) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/socket.cpython-311-x86_64-linux-gnu.so] ID 0x27
(00.000455) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/context.cpython-311-x86_64-linux-gnu.so] ID 0x28
(00.000457) Collected [usr/lib/x86_64-linux-gnu/libgcc_s.so.1] ID 0x29
(00.000459) Collected [usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30] ID 0x2a
(00.000462) Collected [usr/local/lib/python3.11/dist-packages/pyzmq.libs/libsodium-cb25555f.so.23.3.0] ID 0x2b
(00.000464) Collected [usr/local/lib/python3.11/dist-packages/pyzmq.libs/libzmq-f468291a.so.5.2.4] ID 0x2c
(00.000466) Collected [usr/local/lib/python3.11/dist-packages/zmq/backend/cython/_device.cpython-311-x86_64-linux-gnu.so] ID 0x2d
(00.000468) Collected [usr/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so] ID 0x2e
(00.000470) Collected [usr/local/lib/python3.11/dist-packages/tornado/speedups.abi3.so] ID 0x2f
(00.000472) Collected [usr/lib/x86_64-linux-gnu/libpthread.so.0] ID 0x30
(00.000474) Collected [usr/lib/x86_64-linux-gnu/librt.so.1] ID 0x31
(00.000475) Collected [usr/lib/x86_64-linux-gnu/libffi.so.8.1.0] ID 0x32
(00.000477) Collected [usr/lib/python3.11/lib-dynload/_contextvars.cpython-311-x86_64-linux-gnu.so] ID 0x33
(00.000481) Collected [usr/lib/python3.11/lib-dynload/_typing.cpython-311-x86_64-linux-gnu.so] ID 0x34
(00.000483) Collected [usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache] ID 0x35
(00.000485) Collected [usr/lib/locale/C.utf8/LC_CTYPE] ID 0x36
(00.000486) Collected [usr/lib/x86_64-linux-gnu/libc.so.6] ID 0x37
(00.000488) Collected [usr/lib/x86_64-linux-gnu/libexpat.so.1.8.7] ID 0x38
(00.000490) Collected [usr/lib/x86_64-linux-gnu/libz.so.1.2.11] ID 0x39
(00.000492) Collected [usr/lib/x86_64-linux-gnu/libm.so.6] ID 0x3a
(00.000494) Collected [usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] ID 0x3b
(00.000497) Collected [dev/null] ID 0x3c
(00.000502) Collected pipe entry ID 0x3d PIPE ID 0xac3d
(00.000509) Found id pipe:[44093] (fd 2) in inherit fd list
(00.000511) Collected pipe entry ID 0x3e PIPE ID 0xac3e
(00.000515) Found id pipe:[44094] (fd 4) in inherit fd list
(00.000519) epoll: Collected eventpoll: id 0x00003f flags 0x02
(00.000526) unix:  `- Got id 0x41 ino 39413 type SOCK_STREAM state TCP_ESTABLISHED peer 39412 (name - dir -)
(00.000531) unix:  `- Got id 0x40 ino 39412 type SOCK_STREAM state TCP_ESTABLISHED peer 39413 (name - dir -)
(00.000543) Collected [tf] ID 0x43
(00.000545) Collected [.] ID 0x44
(00.000549)  `- ... done
(00.000550) Collecting 46/68 (flags 0)
(00.000555) No remap-fpath.img image
(00.000557)  `- ... done
(00.000561) No apparmor.img image
(00.000573) cg: Preparing cgroups yard (cgroups restore mode 0x4)
(00.000665) cg: Opening .criu.cgyard.67Z7Lz as cg yard
(00.000673) cg: 	Making controller dir .criu.cgyard.67Z7Lz/unified ()
(00.000695) cg: Determined cgroup dir unified/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6846ad3e_2542_4718_b9cf_c57d0f665225.slice/crio-9184ce03a7817e2c28fd27c9c502dbc71fea51f0f2a0d819c09a0624499687f2.scope already exist
(00.000696) cg: Skip restoring properties on cgroup dir unified/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6846ad3e_2542_4718_b9cf_c57d0f665225.slice/crio-9184ce03a7817e2c28fd27c9c502dbc71fea51f0f2a0d819c09a0624499687f2.scope
(00.000710) Running pre-restore scripts
(00.000711) 	RPC
(00.000820) Saved netns fd for links restore
(00.000844) mnt: Reading mountpoint images (id 13 pid 1)
(00.000849) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995"
(00.000851) mnt: 		Will mount 2778 from /
(00.000854) mnt: 		Will mount 2778 @ /tmp/.criu.mntns.peE5uw/mnt-0000002778 /sys/firmware
(00.000855) mnt: 	Read 2778 mp @ /sys/firmware
(00.000858) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",size=65536k,mode=755
(00.000860) mnt: 		Will mount 2777 from /dev/null (E)
(00.000862) mnt: 		Will mount 2777 @ /tmp/.criu.mntns.peE5uw/mnt-0000002777 /proc/timer_list
(00.000863) mnt: 	Read 2777 mp @ /proc/timer_list
(00.000865) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",size=65536k,mode=755
(00.000867) mnt: 		Will mount 2776 from /dev/null (E)
(00.000868) mnt: 		Will mount 2776 @ /tmp/.criu.mntns.peE5uw/mnt-0000002776 /proc/latency_stats
(00.000869) mnt: 	Read 2776 mp @ /proc/latency_stats
(00.000873) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",size=65536k,mode=755
(00.000874) mnt: 		Will mount 2775 from /dev/null (E)
(00.000876) mnt: 		Will mount 2775 @ /tmp/.criu.mntns.peE5uw/mnt-0000002775 /proc/keys
(00.000877) mnt: 	Read 2775 mp @ /proc/keys
(00.000879) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",size=65536k,mode=755
(00.000880) mnt: 		Will mount 2774 from /dev/null (E)
(00.000882) mnt: 		Will mount 2774 @ /tmp/.criu.mntns.peE5uw/mnt-0000002774 /proc/kcore
(00.000883) mnt: 	Read 2774 mp @ /proc/kcore
(00.000885) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995"
(00.000886) mnt: 		Will mount 2773 from /
(00.000887) mnt: 		Will mount 2773 @ /tmp/.criu.mntns.peE5uw/mnt-0000002773 /proc/acpi
(00.000888) mnt: 	Read 2773 mp @ /proc/acpi
(00.000890) mnt: 		Will mount 2772 from /sysrq-trigger
(00.000892) mnt: 		Will mount 2772 @ /tmp/.criu.mntns.peE5uw/mnt-0000002772 /proc/sysrq-trigger
(00.000893) mnt: 	Read 2772 mp @ /proc/sysrq-trigger
(00.000894) mnt: 		Will mount 2771 from /sys
(00.000896) mnt: 		Will mount 2771 @ /tmp/.criu.mntns.peE5uw/mnt-0000002771 /proc/sys
(00.000897) mnt: 	Read 2771 mp @ /proc/sys
(00.000901) mnt: 		Will mount 2770 from /irq
(00.000902) mnt: 		Will mount 2770 @ /tmp/.criu.mntns.peE5uw/mnt-0000002770 /proc/irq
(00.000903) mnt: 	Read 2770 mp @ /proc/irq
(00.000905) mnt: 		Will mount 2769 from /fs
(00.000908) mnt: 		Will mount 2769 @ /tmp/.criu.mntns.peE5uw/mnt-0000002769 /proc/fs
(00.000909) mnt: 	Read 2769 mp @ /proc/fs
(00.000911) mnt: 		Will mount 2768 from /bus
(00.000912) mnt: 		Will mount 2768 @ /tmp/.criu.mntns.peE5uw/mnt-0000002768 /proc/bus
(00.000913) mnt: 	Read 2768 mp @ /proc/bus
(00.000915) mnt: 		Will mount 2737 from /
(00.000917) mnt: 		Will mount 2737 @ /tmp/.criu.mntns.peE5uw/mnt-0000002737 /proc/driver/nvidia
(00.000918) mnt: 	Read 2737 mp @ /proc/driver/nvidia
(00.000919) mnt: 		Will mount 1520 from /
(00.000921) mnt: 		Will mount 1520 @ /tmp/.criu.mntns.peE5uw/mnt-0000001520 /run/secrets/kubernetes.io/serviceaccount
(00.000922) mnt: 	Read 1520 mp @ /run/secrets/kubernetes.io/serviceaccount
(00.000926) mnt: 		Will mount 1519 from /run/containers/storage/overlay-containers/157059dcd937721eec8e72aa589f8aad60857a0f8f8d61fb2c3ccceb6fc489ed/userdata/run/secrets (E)
(00.000928) mnt: 		Will mount 1519 @ /tmp/.criu.mntns.peE5uw/mnt-0000001519 /run/secrets
(00.000929) mnt: 	Read 1519 mp @ /run/secrets
(00.000931) mnt: 		Will mount 1517 from /var/lib/kubelet/pods/25e440ad-5904-412d-92ec-662251bd941a/containers/tf-notebook/92571801 (E)
(00.000933) mnt: 		Will mount 1517 @ /tmp/.criu.mntns.peE5uw/mnt-0000001517 /dev/termination-log
(00.000934) mnt: 	Read 1517 mp @ /dev/termination-log
(00.000936) mnt: 		Will mount 2733 from /var/lib/kubelet/pods/25e440ad-5904-412d-92ec-662251bd941a/etc-hosts (E)
(00.000937) mnt: 		Will mount 2733 @ /tmp/.criu.mntns.peE5uw/mnt-0000002733 /etc/hosts
(00.000939) mnt: 	Read 2733 mp @ /etc/hosts
(00.000941) mnt: 		Will mount 2732 from /run/containers/storage/overlay-containers/8223726421ff39b9edd799cc6f81ea17bd1f65ac289c889e0da91f4e65dafe5a/userdata/.containerenv (E)
(00.000942) mnt: 		Will mount 2732 @ /tmp/.criu.mntns.peE5uw/mnt-0000002732 /run/.containerenv
(00.000943) mnt: 	Read 2732 mp @ /run/.containerenv
(00.000947) mnt: 		Will mount 2731 from /run/containers/storage/overlay-containers/8223726421ff39b9edd799cc6f81ea17bd1f65ac289c889e0da91f4e65dafe5a/userdata/hostname (E)
(00.000948) mnt: 		Will mount 2731 @ /tmp/.criu.mntns.peE5uw/mnt-0000002731 /etc/hostname
(00.000949) mnt: 	Read 2731 mp @ /etc/hostname
(00.000951) mnt: 		Will mount 2729 from /run/containers/storage/overlay-containers/8223726421ff39b9edd799cc6f81ea17bd1f65ac289c889e0da91f4e65dafe5a/userdata/resolv.conf (E)
(00.000953) mnt: 		Will mount 2729 @ /tmp/.criu.mntns.peE5uw/mnt-0000002729 /etc/resolv.conf
(00.000954) mnt: 	Read 2729 mp @ /etc/resolv.conf
(00.000956) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",size=65536k
(00.000957) mnt: 		Will mount 2726 from /run/containers/storage/overlay-containers/8223726421ff39b9edd799cc6f81ea17bd1f65ac289c889e0da91f4e65dafe5a/userdata/shm (E)
(00.000959) mnt: 		Will mount 2726 @ /tmp/.criu.mntns.peE5uw/mnt-0000002726 /dev/shm
(00.000960) mnt: 	Read 2726 mp @ /dev/shm
(00.000962) mnt: 		Will mount 2724 from /
(00.000963) mnt: 		Will mount 2724 @ /tmp/.criu.mntns.peE5uw/mnt-0000002724 /sys/fs/cgroup
(00.000964) mnt: 	Read 2724 mp @ /sys/fs/cgroup
(00.000966) mnt: 		Will mount 2722 from /
(00.000970) mnt: 		Will mount 2722 @ /tmp/.criu.mntns.peE5uw/mnt-0000002722 /sys
(00.000972) mnt: 	Read 2722 mp @ /sys
(00.000973) mnt: 		Will mount 2720 from /
(00.000975) mnt: 		Will mount 2720 @ /tmp/.criu.mntns.peE5uw/mnt-0000002720 /dev/mqueue
(00.000976) mnt: 	Read 2720 mp @ /dev/mqueue
(00.000978) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",gid=5,mode=620,ptmxmode=666,newinstance
(00.000979) mnt: 		Will mount 2719 from /
(00.000980) mnt: 		Will mount 2719 @ /tmp/.criu.mntns.peE5uw/mnt-0000002719 /dev/pts
(00.000982) mnt: 	Read 2719 mp @ /dev/pts
(00.000983) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",size=65536k,mode=755
(00.000985) mnt: 		Will mount 2718 from /
(00.000986) mnt: 		Will mount 2718 @ /tmp/.criu.mntns.peE5uw/mnt-0000002718 /dev
(00.000987) mnt: 	Read 2718 mp @ /dev
(00.000989) mnt: 		Will mount 2262 from /
(00.000990) mnt: 		Will mount 2262 @ /tmp/.criu.mntns.peE5uw/mnt-0000002262 /proc
(00.000992) mnt: 	Read 2262 mp @ /proc
(00.001004) mnt: 		Changed mount 'context=' to context="system_u:object_r:container_file_t:s0:c913,c995",lowerdir=/var/lib/containers/storage/overlay/l/4ZEWMEU4RQJO7J3R4DCWMIWBE4:/var/lib/containers/storage/overlay/l/OP3DR7GUJELQNUVTCXEUNCSPJD:/var/lib/containers/storage/overlay/l/XL6OE33IAGYDVQBOXCWCAB6ZID:/var/lib/containers/storage/overlay/l/JZNTZZQZ2IW55FWR7JR74ZEA2P:/var/lib/containers/storage/overlay/l/ESKISPFTDB2D3RVPC6EBX6TSPU:/var/lib/containers/storage/overlay/l/7Y7Z72E566UBLUPYNCVGNGTOUP:/var/lib/containers/storage/overlay/l/Z6VTQ6JDQ23Y4RHVY56SEIVM7P:/var/lib/containers/storage/overlay/l/KYZ77I3TWP5WFGRRPOKVOMR4TC:/var/lib/containers/storage/overlay/l/4GATLGJ53Z3DGS3EWHK47FVRFS:/var/lib/containers/storage/overlay/l/YIXNEQZRPXGZDSZCE56DJZF5PY:/var/lib/containers/storage/overlay/l/5J3X2KU4MP5NLNLD6ZWPD2EEFG:/var/lib/containers/storage/overlay/l/CQUUMUOCD3SAD4KDDQFEZWCUAU:/var/lib/containers/storage/overlay/l/QKEYBZRBNCMI73E7E6ASOBSR2U:/var/lib/containers/storage/overlay/l/GVE6WTKXNC2W3UFLR3LILOFL4L:/var/lib/containers/storage/overlay/l/HTJ4VVD6IMMMSJAC2UIS4EBS73:/var/lib/containers/storage/overlay/l/CLI6RSOGHAMDJ7JGTTRZTO3W53:/var/lib/containers/storage/overlay/l/MXANSFSYVLJDAKQ6FRMR5FKC25:/var/lib/containers/storage/overlay/l/QPCZK3BOLA7JKBUBKVB7ZJ26G5:/var/lib/containers/storage/overlay/l/XIKLJ4H7TXQLD6TRSOFAKS6YR6:/var/lib/containers/storage/overlay/l/EAYWK6DQHRIGLFNAYINJP7QC6B:/var/lib/containers/storage/overlay/l/3H5ZXTDT7RR353QVAH76UEVFRB:/var/lib/containers/storage/overlay/l/SVWOSFLQFVZT5WYB233D2VKATX:/var/lib/containers/storage/overlay/l/3SH6LQUPNLPJZWNBDTORGGWHET:/var/lib/containers/storage/overlay/l/NTSVNFTR5EKQVSFSEWIFIZIKVT,upperdir=/var/lib/containers/storage/overlay/b0573031adf17e9181a472c61e1273d7dc11821ed11d3db330045ea48eb77302/diff,workdir=/var/lib/containers/storage/overlay/b0573031adf17e9181a472c61e1273d7dc11821ed11d3db330045ea48eb77302/work,metacopy=on,volatile
(00.001007) mnt: 		Will mount 1857 from /
(00.001009) mnt: 		Will mount 1857 @ /tmp/.criu.mntns.peE5uw/mnt-0000001857 /
(00.001010) mnt: 	Read 1857 mp @ /
(00.001013) mnt: Building mountpoints tree
(00.001015) mnt: 	Building plain mount tree
(00.001016) mnt: 		Working on 1857->2346
(00.001017) mnt: 		Working on 2262->1857
(00.001018) mnt: 		Working on 2718->1857
(00.001019) mnt: 		Working on 2719->2718
(00.001020) mnt: 		Working on 2720->2718
(00.001022) mnt: 		Working on 2722->1857
(00.001023) mnt: 		Working on 2724->2722
(00.001024) mnt: 		Working on 2726->2718
(00.001025) mnt: 		Working on 2729->1857
(00.001026) mnt: 		Working on 2731->1857
(00.001027) mnt: 		Working on 2732->1857
(00.001028) mnt: 		Working on 2733->1857
(00.001029) mnt: 		Working on 1517->2718
(00.001030) mnt: 		Working on 1519->1857
(00.001031) mnt: 		Working on 1520->1519
(00.001032) mnt: 		Working on 2737->2262
(00.001033) mnt: 		Working on 2768->2262
(00.001034) mnt: 		Working on 2769->2262
(00.001036) mnt: 		Working on 2770->2262
(00.001037) mnt: 		Working on 2771->2262
(00.001038) mnt: 		Working on 2772->2262
(00.001039) mnt: 		Working on 2773->2262
(00.001040) mnt: 		Working on 2774->2262
(00.001041) mnt: 		Working on 2775->2262
(00.001042) mnt: 		Working on 2776->2262
(00.001043) mnt: 		Working on 2777->2262
(00.001044) mnt: 		Working on 2778->2722
(00.001045) mnt: 	Resorting children of 1857 in mount order
(00.001047) mnt: 	Resorting children of 2729 in mount order
(00.001048) mnt: 	Resorting children of 2731 in mount order
(00.001049) mnt: 	Resorting children of 2732 in mount order
(00.001050) mnt: 	Resorting children of 2733 in mount order
(00.001051) mnt: 	Resorting children of 1519 in mount order
(00.001052) mnt: 	Resorting children of 1520 in mount order
(00.001053) mnt: 	Resorting children of 2262 in mount order
(00.001056) mnt: 	Resorting children of 2737 in mount order
(00.001057) mnt: 	Resorting children of 2768 in mount order
(00.001058) mnt: 	Resorting children of 2769 in mount order
(00.001059) mnt: 	Resorting children of 2770 in mount order
(00.001061) mnt: 	Resorting children of 2771 in mount order
(00.001062) mnt: 	Resorting children of 2772 in mount order
(00.001063) mnt: 	Resorting children of 2773 in mount order
(00.001064) mnt: 	Resorting children of 2774 in mount order
(00.001065) mnt: 	Resorting children of 2775 in mount order
(00.001066) mnt: 	Resorting children of 2776 in mount order
(00.001067) mnt: 	Resorting children of 2777 in mount order
(00.001068) mnt: 	Resorting children of 2718 in mount order
(00.001070) mnt: 	Resorting children of 2719 in mount order
(00.001071) mnt: 	Resorting children of 2720 in mount order
(00.001072) mnt: 	Resorting children of 2726 in mount order
(00.001073) mnt: 	Resorting children of 1517 in mount order
(00.001074) mnt: 	Resorting children of 2722 in mount order
(00.001075) mnt: 	Resorting children of 2724 in mount order
(00.001076) mnt: 	Resorting children of 2778 in mount order
(00.001077) mnt: Done:
(00.001078) mnt: [/](1857->2346)
(00.001079) mnt:  [/etc/resolv.conf](2729->1857)
(00.001081) mnt:  <--
(00.001082) mnt:  [/etc/hostname](2731->1857)
(00.001083) mnt:  <--
(00.001084) mnt:  [/run/.containerenv](2732->1857)
(00.001085) mnt:  <--
(00.001086) mnt:  [/etc/hosts](2733->1857)
(00.001087) mnt:  <--
(00.001088) mnt:  [/run/secrets](1519->1857)
(00.001090) mnt:   [/run/secrets/kubernetes.io/serviceaccount](1520->1519)
(00.001091) mnt:   <--
(00.001092) mnt:  <--
(00.001093) mnt:  [/proc](2262->1857)
(00.001094) mnt:   [/proc/driver/nvidia](2737->2262)
(00.001095) mnt:   <--
(00.001096) mnt:   [/proc/bus](2768->2262)
(00.001097) mnt:   <--
(00.001098) mnt:   [/proc/fs](2769->2262)
(00.001100) mnt:   <--
(00.001101) mnt:   [/proc/irq](2770->2262)
(00.001102) mnt:   <--
(00.001103) mnt:   [/proc/sys](2771->2262)
(00.001104) mnt:   <--
(00.001105) mnt:   [/proc/sysrq-trigger](2772->2262)
(00.001106) mnt:   <--
(00.001107) mnt:   [/proc/acpi](2773->2262)
(00.001108) mnt:   <--
(00.001109) mnt:   [/proc/kcore](2774->2262)
(00.001110) mnt:   <--
(00.001111) mnt:   [/proc/keys](2775->2262)
(00.001113) mnt:   <--
(00.001114) mnt:   [/proc/latency_stats](2776->2262)
(00.001115) mnt:   <--
(00.001116) mnt:   [/proc/timer_list](2777->2262)
(00.001117) mnt:   <--
(00.001118) mnt:  <--
(00.001119) mnt:  [/dev](2718->1857)
(00.001120) mnt:   [/dev/pts](2719->2718)
(00.001121) mnt:   <--
(00.001122) mnt:   [/dev/mqueue](2720->2718)
(00.001123) mnt:   <--
(00.001124) mnt:   [/dev/shm](2726->2718)
(00.001126) mnt:   <--
(00.001127) mnt:   [/dev/termination-log](1517->2718)
(00.001128) mnt:   <--
(00.001129) mnt:  <--
(00.001130) mnt:  [/sys](2722->1857)
(00.001131) mnt:   [/sys/fs/cgroup](2724->2722)
(00.001132) mnt:   <--
(00.001133) mnt:   [/sys/firmware](2778->2722)
(00.001134) mnt:   <--
(00.001135) mnt:  <--
(00.001136) mnt: <--
(00.001138) mnt: 	The mount 2768 is bind for 2262 (@/proc/bus -> @/proc)
(00.001139) mnt: 	The mount 2769 is bind for 2262 (@/proc/fs -> @/proc)
(00.001141) mnt: 	The mount 2770 is bind for 2262 (@/proc/irq -> @/proc)
(00.001142) mnt: 	The mount 2771 is bind for 2262 (@/proc/sys -> @/proc)
(00.001143) mnt: 	The mount 2772 is bind for 2262 (@/proc/sysrq-trigger -> @/proc)
(00.001144) mnt: 	The mount 2774 is bind for 2718 (@/proc/kcore -> @/dev)
(00.001146) mnt: 	The mount 2775 is bind for 2718 (@/proc/keys -> @/dev)
(00.001147) mnt: 	The mount 2776 is bind for 2718 (@/proc/latency_stats -> @/dev)
(00.001148) mnt: 	The mount 2777 is bind for 2718 (@/proc/timer_list -> @/dev)
(00.001150) mnt: 	The mount 2731 is bind for 2729 (@/etc/hostname -> @/etc/resolv.conf)
(00.001151) mnt: 	The mount 2732 is bind for 2729 (@/run/.containerenv -> @/etc/resolv.conf)
(00.001152) mnt: 	The mount 1519 is bind for 2729 (@/run/secrets -> @/etc/resolv.conf)
(00.001153) mnt: 	The mount 1517 is bind for 2733 (@/dev/termination-log -> @/etc/hosts)
(00.001155) mnt: Start with 1857:/
(00.001160) mnt-v2: Inspecting sharing on 2726 shared_id 0 master_id 739 (@/dev/shm)
(00.001162) mnt-v2: Inspecting sharing on 2729 shared_id 0 master_id 12 (@/etc/resolv.conf)
(00.001163) mnt-v2: Inspecting sharing on 2731 shared_id 0 master_id 12 (@/etc/hostname)
(00.001165) mnt-v2: Inspecting sharing on 2732 shared_id 0 master_id 12 (@/run/.containerenv)
(00.001168) mnt-v2: Detected external slavery for shared group (0, 12) with source /run/containers/storage/overlay-containers/8223726421ff39b9edd799cc6f81ea17bd1f65ac289c889e0da91f4e65dafe5a/userdata/.containerenv
(00.001170) mnt-v2: Detected external slavery for shared group (0, 739) with source /run/containers/storage/overlay-containers/8223726421ff39b9edd799cc6f81ea17bd1f65ac289c889e0da91f4e65dafe5a/userdata/shm
(00.001171) mnt: Mountpoint 1857 (@/) moved to the root yard
(00.001183) No pidns-9.img image
(00.001231) Warn  (criu/cr-restore.c:1296): Set CLONE_PARENT | CLONE_NEWPID but it might cause restore problem,because not all kernels support such clone flags combinations!
(00.001233) Forking task with 1 pid (flags 0x6c028000)
(00.001234) Creating process using clone3()
(00.001662) PID: real 98768 virt 1
(00.001742) Wait until namespaces are created
(00.001750)      1: Found id extRootNetNS (fd 14) in inherit fd list
(00.001903)      1: timens: monotonic -138 389360018
(00.001912)      1: timens: boottime -138 389351230
(00.001952) Running setup-namespaces scripts
(00.001955) 	RPC
(00.045753) Client exited unexpectedly
(00.045757) Error (criu/action-scripts.c:137): One of more action scripts failed
(00.046145) Error (criu/cr-restore.c:2536): Restoring FAILED.
(00.046221) Error (criu/cr-service.c:120): Can't send response: Broken pipe
(00.046223) Error (criu/cr-service.c:796): Can't send response: Broken pipe

Output of `criu --version`:

Version: 3.17.1

AWS EKS cluster with custom AMIs running cri-o. NVIDIA GPU operator v23.9.0 installed via helm chart with operator.defaultRuntime=crio
@nuwang nuwang added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2023
@haircommander
Copy link
Member

@adrianreber @kolyshkin PTAL

@kolyshkin
Copy link
Collaborator

@nuwang this looks like an issue with the mentioned hook (rather than with cri-o, runc, or criu). I suggest you file a bug to https://github.com/NVIDIA/gpu-operator

@adrianreber
Copy link
Member

I am pretty sure no one has tried this before.

I also have no experience with any nvidia systems.

The CRIU message about the action scripts is also rather unusual.

@nuwang
Copy link
Author

nuwang commented Jan 2, 2024

Thanks for looking into this. I do no think this is limited to the operator in question, because I've also tried switching to CDI: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html, and that doesn't work either (although with a different set of errors).

My understanding of this extremely sketchy, but in the runtime hooks case, the runc json spec appears to be getting modified by the hook to inject the device. In the CDI case (https://github.com/cncf-tags/container-device-interface), what appears to be happening is that the OCI spec is modified by the hook. In both cases, it appears that the sequence of events that unfold during CRIO/CRIU's restoration process doesn't allow hooks to execute in the expected order?

@haircommander
Copy link
Member

@nuwang it may be helpful to put together a OCI runtime spec that doesn't use nvidia but does hit this issue, is that possible?

@adrianreber
Copy link
Member

My understanding of this extremely sketchy, but in the runtime hooks case, the runc json spec appears to be getting modified by the hook to inject the device. In the CDI case (https://github.com/cncf-tags/container-device-interface), what appears to be happening is that the OCI spec is modified by the hook. In both cases, it appears that the sequence of events that unfold during CRIO/CRIU's restoration process doesn't allow hooks to execute in the expected order?

I don't know at which point the hooks modify config.json. But you are right, during restore we have a slightly different flow. We take the information from the checkpoint and create a config.json where we select all the information which we take from the checkpointed container and fill the new config.json with that information. At this point only that information is selected from which we know it is necessary. No hooks are being executed from what I know.

@nuwang it may be helpful to put together a OCI runtime spec that doesn't use nvidia but does hit this issue, is that possible?

Yes, that would be helpful. If you could share a checkpoint archive which fails we should be able to see which information from the checkpointed container needs to be added to the restored container.

We do a lot of changes like this starting here: https://github.com/cri-o/cri-o/blob/main/server/container_restore.go#L152

If we know what nvidia needs we can pull the information from the checkpointed container and add it to the restored container. Or maybe we need to run some hooks also during the restore case. Depends on the change.

@nuwang
Copy link
Author

nuwang commented Jan 2, 2024

I think the CDI case seems like a better candidate to target for a first pass instead of the runtime hook, because according to this: https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#cri-o-configuration, CDI is enabled in CRI-O by default. That should make reproduction easier and CDI should probably be made to work with CRIO/CRIU regardless?

I hadn't even heard of CDI until I ran into this issue, so it's entirely possible that what I'm proposing is incorrect, but the CDI spec itself appears to be fairly simple. It looks like
a. simply putting this json spec in the /etc/cdi folder with some hardcoded folders/devices: https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#full-blown-cdi-specification
b. Passing in container annotations to activate that dummy device somehow: containerd/containerd#7329

should result in something that could be used in a test case?

I have blown away my environment, but I did preserve a docker container with a checkpointed archive.
This is the checkpoint image: quay.io/nuwan_ag/tensorflow:cdi
And this is the original container: tensorflow/tensorflow:latest-gpu-jupyter

If you need something else let me know, but it might take me a while to recreate the necessary environments.

@xiongzubiao
Copy link

xiongzubiao commented Jan 25, 2024

Hi @nuwang , how did you make the checkpoint work? I can't even checkpoint when the nvidia container runtime is used (although gpu is not requested):

(00.008393) Error: mnt: Mount 4401 ./proc/driver/nvidia/gpus/0000:00:1e.0 (master_id: 350 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

I believe those mounts are added by Nvidia's prestart hook, after the container is created but before the process is started. They are not in the container's bundle spec (config.json). Because runc doesn't know them, it doesn't pass ExtMnt options for them to criu.

@nuwang
Copy link
Author

nuwang commented Jan 28, 2024

I added the enable-external-masters option, which can be added to /etc/criu/runc.conf

@xiongzubiao
Copy link

xiongzubiao commented Feb 2, 2024

@nuwang I tried to add enable-external-masters to /etc/criu/runc.conf, but still see the same mount error. Then I tried to modify criu code to just ignore the nvidia mounts which criu can't handle. The checkpoint passed. Then I see the same restore error as you.

Copy link

github-actions bot commented Mar 4, 2024

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2024
Copy link

github-actions bot commented Jun 2, 2024

Closing this issue since it had no activity in the past 90 days.

@github-actions github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 2, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpoint/restore kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants