Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crio context deadline exceeded #8007

Open
pycgo opened this issue Apr 12, 2024 · 11 comments
Open

crio context deadline exceeded #8007

pycgo opened this issue Apr 12, 2024 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@pycgo
Copy link

pycgo commented Apr 12, 2024

What happened?

[root@node7 ~]# crictl ps
FATA[0002] validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = DeadlineExceeded desc = context deadline exceeded

What did you expect to happen?

k8s node will be notready

How can we reproduce it (as minimally and precisely as possible)?

He has been running for a while and it is not easy to replicate. In the past year, I have encountered this problem about five or six times using CRIO in different clusters

Anything else we need to know?

when systemctl restart cri-o, problem is gone。

CRI-O and Kubernetes version

$ crio --version
1.27.1  1.28.2
$ kubectl version --output=json
crio-1.27.1 used in k8s 1.27.4
crio-1.28.2 used  in k8s 1.28.5

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
$ uname -a
# paste output here
Linux master 5.4.225-200.el7.x86_64 #1 SMP Mon Jan 2 15:36:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

@pycgo pycgo added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2024
@haircommander
Copy link
Member

can you post the cri-o logs while this is happening?

@pycgo
Copy link
Author

pycgo commented Apr 12, 2024

can you post the cri-o logs while this is happening?

journalctl -u cri-o?

@haircommander
Copy link
Member

yeah!

@pycgo
Copy link
Author

pycgo commented Apr 12, 2024

this is message happening

Apr 11 21:50:04 node7 crio: time="2024-04-11 21:50:04.343272892+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=76e1a622-ce3d-4a1f-a686-58b026b73c8c name=/runtime.v1.ImageService/ImageStatus
Apr 11 21:50:04 node7 crio: time="2024-04-11 21:50:04.343488104+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/linkflow/pause:3.9],RepoDigests:[registry.aliyuncs.com/linkflow/pause@sha256:0fc1f3b764be56f7c881a69cbd553ae25a2b5523c6901fbacb8270307c29d0c4],Size_:748009,Uid:&Int64Value{Value:65535,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=76e1a622-ce3d-4a1f-a686-58b026b73c8c name=/runtime.v1.ImageService/ImageStatus
Apr 11 21:51:01 node7 systemd: Started Session 74940 of user root.
Apr 11 21:52:01 node7 systemd: Started Session 74941 of user root.
Apr 11 21:53:01 node7 systemd: Started Session 74942 of user root.
Apr 11 21:54:01 node7 systemd: Started Session 74943 of user root.
Apr 11 21:55:01 node7 systemd: Started Session 74944 of user root.
Apr 11 21:55:04 node7 crio: time="2024-04-11 21:55:04.346715313+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=efa5b073-1589-4f68-ba45-2ac4b892dd76 name=/runtime.v1.ImageService/ImageStatus
Apr 11 21:55:04 node7 crio: time="2024-04-11 21:55:04.346949509+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/linkflow/pause:3.9],RepoDigests:[registry.aliyuncs.com/linkflow/pause@sha256:0fc1f3b764be56f7c881a69cbd553ae25a2b5523c6901fbacb8270307c29d0c4],Size_:748009,Uid:&Int64Value{Value:65535,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=efa5b073-1589-4f68-ba45-2ac4b892dd76 name=/runtime.v1.ImageService/ImageStatus
Apr 11 21:56:01 node7 systemd: Started Session 74945 of user root.
Apr 11 21:57:01 node7 systemd: Started Session 74946 of user root.
Apr 11 21:58:01 node7 systemd: Started Session 74947 of user root.
Apr 11 21:59:01 node7 systemd: Started Session 74948 of user root.
Apr 11 22:00:01 node7 systemd: Started Session 74949 of user root.
Apr 11 22:00:04 node7 crio: time="2024-04-11 22:00:04.350012002+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=66416206-0a2d-4460-89a3-f2d749917e50 name=/runtime.v1.ImageService/ImageStatus
Apr 11 22:00:04 node7 crio: time="2024-04-11 22:00:04.350245730+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/linkflow/pause:3.9],RepoDigests:[registry.aliyuncs.com/linkflow/pause@sha256:0fc1f3b764be56f7c881a69cbd553ae25a2b5523c6901fbacb8270307c29d0c4],Size_:748009,Uid:&Int64Value{Value:65535,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=66416206-0a2d-4460-89a3-f2d749917e50 name=/runtime.v1.ImageService/ImageStatus
Apr 11 22:01:01 node7 systemd: Started Session 74950 of user root.
Apr 11 22:01:01 node7 systemd: Started Session 74951 of user root.
Apr 11 22:02:01 node7 systemd: Started Session 74952 of user root.
Apr 11 22:03:01 node7 systemd: Started Session 74953 of user root.
Apr 11 22:03:35 node7 kubelet: E0411 22:03:35.137593   14277 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="586b18245a375bf09f4da25d029c2471d1e72f943fd5b5f78f784f6db43d8e0f" cmd=[/bin/calico-node -bird-ready -felix-ready]
Apr 11 22:04:01 node7 systemd: Started Session 74954 of user root.
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103826   14277 remote_runtime.go:277] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" filter="&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103895   14277 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103915   14277 kubelet_pods.go:1103] "Error listing containers" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103928   14277 kubelet.go:2427] "Failed cleaning pods" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103941   14277 kubelet.go:2431] "Housekeeping took longer than expected" err="housekeeping took too long" expected="1s" actual="2m50.48s"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103956   14277 kubelet.go:2281] "Skipping pod synchronization" err="container runtime is down"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103843   14277 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="5107706393b4c2f329c856f1dce971434f64dc790cb2897440ed8464f5f496d3"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103985   14277 remote_runtime.go:277] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" filter="nil"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.103994   14277 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.104005   14277 container_log_manager.go:235] "Failed to get container status" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="5107706393b4c2f329c856f1dce971434f64dc790cb2897440ed8464f5f496d3"
Apr 11 22:04:11 node7 kubelet: E0411 22:04:11.104218   14277 kubelet.go:1371] "Container garbage collection failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:05:01 node7 systemd: Started Session 74955 of user root.
Apr 11 22:06:01 node7 systemd: Started Session 74956 of user root.
Apr 11 22:06:11 node7 kubelet: E0411 22:06:11.113781   14277 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="7f375a77bb311b99f25074f99058bdbb6f5b44ece39378c15d92503bc36bc49e"
Apr 11 22:06:11 node7 kubelet: E0411 22:06:11.113844   14277 container_log_manager.go:235] "Failed to get container status" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="7f375a77bb311b99f25074f99058bdbb6f5b44ece39378c15d92503bc36bc49e"
Apr 11 22:06:11 node7 crio: time="2024-04-11 22:06:11.120433243+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=2815d546-2efa-4cef-9f10-ba0dcdaaec66 name=/runtime.v1.ImageService/ImageStatus
Apr 11 22:06:11 node7 crio: time="2024-04-11 22:06:11.120643208+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/linkflow/pause:3.9],RepoDigests:[registry.aliyuncs.com/linkflow/pause@sha256:0fc1f3b764be56f7c881a69cbd553ae25a2b5523c6901fbacb8270307c29d0c4],Size_:748009,Uid:&Int64Value{Value:65535,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=2815d546-2efa-4cef-9f10-ba0dcdaaec66 name=/runtime.v1.ImageService/ImageStatus
Apr 11 22:06:11 node7 kubelet: I0411 22:06:11.121778   14277 setters.go:552] "Node became not ready" node="node7" condition={Type:Ready Status:False LastHeartbeatTime:2024-04-11 22:06:11.121712389 +0800 CST m=+4423015.057618713 LastTransitionTime:2024-04-11 22:06:11.121712389 +0800 CST m=+4423015.057618713 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m59.017556433s ago; threshold is 3m0s}
Apr 11 22:06:16 node7 kubelet: E0411 22:06:16.164187   14277 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="586b18245a375bf09f4da25d029c2471d1e72f943fd5b5f78f784f6db43d8e0f" cmd=[/bin/calico-node -bird-ready -felix-ready]
Apr 11 22:07:01 node7 systemd: Started Session 74957 of user root.
Apr 11 22:08:01 node7 systemd: Started Session 74958 of user root.
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123785   14277 remote_runtime.go:277] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" filter="&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123866   14277 remote_runtime.go:277] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" filter="nil"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123903   14277 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123871   14277 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123952   14277 kubelet_pods.go:1103] "Error listing containers" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123971   14277 kubelet.go:2427] "Failed cleaning pods" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.123990   14277 kubelet.go:2431] "Housekeeping took longer than expected" err="housekeeping took too long" expected="1s" actual="3m59.919s"
Apr 11 22:08:11 node7 kubelet: E0411 22:08:11.124018   14277 kubelet.go:2281] "Skipping pod synchronization" err="[container runtime is down, PLEG is not healthy: pleg was last seen active 3m59.00963546s ago; threshold is 3m0s]"

this is crio

- Logs begin at 二 2024-02-20 17:20:35 CST, end at 五 2024-04-12 22:53:01 CST. --
4月 11 21:50:04 node7 crio[6968]: time="2024-04-11 21:50:04.343272892+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=76e1a622-ce3d-4a1f-a686-58b026b73c8c name=/runtime.v1.ImageService/ImageStatus
4月 11 21:50:04 node7 crio[6968]: time="2024-04-11 21:50:04.343488104+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 21:55:04 node7 crio[6968]: time="2024-04-11 21:55:04.346715313+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=efa5b073-1589-4f68-ba45-2ac4b892dd76 name=/runtime.v1.ImageService/ImageStatus
4月 11 21:55:04 node7 crio[6968]: time="2024-04-11 21:55:04.346949509+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:00:04 node7 crio[6968]: time="2024-04-11 22:00:04.350012002+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=66416206-0a2d-4460-89a3-f2d749917e50 name=/runtime.v1.ImageService/ImageStatus
4月 11 22:00:04 node7 crio[6968]: time="2024-04-11 22:00:04.350245730+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:06:11 node7 crio[6968]: time="2024-04-11 22:06:11.120433243+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=2815d546-2efa-4cef-9f10-ba0dcdaaec66 name=/runtime.v1.ImageService/ImageStatus
4月 11 22:06:11 node7 crio[6968]: time="2024-04-11 22:06:11.120643208+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:14:11 node7 crio[6968]: time="2024-04-11 22:14:11.152505028+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=aed7c6bf-d6b6-450b-b4cd-8e6433addf63 name=/runtime.v1.ImageService/ImageStatus
4月 11 22:14:11 node7 crio[6968]: time="2024-04-11 22:14:11.152874810+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:22:11 node7 crio[6968]: time="2024-04-11 22:22:11.175745207+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=d6f88bac-1113-48b0-8bbf-0a281c28ce49 name=/runtime.v1.ImageService/ImageStatus
4月 11 22:22:11 node7 crio[6968]: time="2024-04-11 22:22:11.176084107+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:30:11 node7 crio[6968]: time="2024-04-11 22:30:11.198768268+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=d48c3c46-57e7-49f8-8526-c12cb33cd380 name=/runtime.v1.ImageService/ImageStatus
4月 11 22:30:11 node7 crio[6968]: time="2024-04-11 22:30:11.199023930+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:38:11 node7 crio[6968]: time="2024-04-11 22:38:11.237726770+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=97745b1a-4526-4090-92f4-9d3bc4d1481d name=/runtime.v1.ImageService/ImageStatus
4月 11 22:38:11 node7 crio[6968]: time="2024-04-11 22:38:11.237985537+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:46:11 node7 crio[6968]: time="2024-04-11 22:46:11.273071957+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=41c06890-1cf3-405b-90d5-5b5952943139 name=/runtime.v1.ImageService/ImageStatus
4月 11 22:46:11 node7 crio[6968]: time="2024-04-11 22:46:11.273854824+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:50:11 node7 crio[6968]: time="2024-04-11 22:50:11.293857819+08:00" level=info msg="Stopping container: adc20c68f0b41b037ff20bfb4f2bfec2d2c8c5e8a86db35cf26ec37fa76dce54 (timeout: 30s)" id=09d05bda-c52d-42f7-9d56-1b139ea5ffb9 name=/r
4月 11 22:50:11 node7 crio[6968]: time="2024-04-11 22:50:11.294096082+08:00" level=info msg="Stopping container: 5107706393b4c2f329c856f1dce971434f64dc790cb2897440ed8464f5f496d3 (timeout: 30s)" id=a12df590-31cc-4572-b65f-b35bb2bb6a45 name=/r
4月 11 22:50:11 node7 crio[6968]: time="2024-04-11 22:50:11.294225625+08:00" level=info msg="Stopping container: fa2bd8715b03b25ae517513b8d0b0b67d963239dedcec63e8d293d730e897586 (timeout: 30s)" id=c4921199-cb7e-48b7-98ba-de1e66b1e3cf name=/r
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300156247+08:00" level=warning msg="Stopping container adc20c68f0b41b037ff20bfb4f2bfec2d2c8c5e8a86db35cf26ec37fa76dce54 with stop signal timed out: context canceled" id=09d05bda-c52
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300197590+08:00" level=info msg="Stopping pod sandbox: 50d6c02598dc65bd386245ec41b9a1b7276cd93fec725f2c44a2e2089bf4a33d" id=699f3c89-b39d-4760-bc28-59dac6a215d3 name=/runtime.v1.Run
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300332571+08:00" level=info msg="Stopping pod sandbox: 7b3dab6725265e8a2339f22d2a966550bd4683cfb475c31e0aa9b407189a6cea" id=e7e87036-26a7-4862-9989-84a5c0c2ea58 name=/runtime.v1.Run
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300403789+08:00" level=info msg="Got pod network &{Name:salesorder2hudi-3242818dc5f34103-exec-1 Namespace:cdp ID:50d6c02598dc65bd386245ec41b9a1b7276cd93fec725f2c44a2e2089bf4a33d UID
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300159723+08:00" level=warning msg="Stopping container 5107706393b4c2f329c856f1dce971434f64dc790cb2897440ed8464f5f496d3 with stop signal timed out: timeout reached after 30 seconds
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300615444+08:00" level=info msg="Deleting pod cdp_salesorder2hudi-3242818dc5f34103-exec-1 from CNI network \"k8s-pod-network\" (type=calico)"
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300730065+08:00" level=info msg="Got pod network &{Name:journey2hudi-69f59a8dc5f3477e-exec-1 Namespace:cdp ID:7b3dab6725265e8a2339f22d2a966550bd4683cfb475c31e0aa9b407189a6cea UID:d8
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300878130+08:00" level=info msg="Deleting pod cdp_journey2hudi-69f59a8dc5f3477e-exec-1 from CNI network \"k8s-pod-network\" (type=calico)"
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.300910807+08:00" level=info msg="Stopping pod sandbox: 5191dc7b42bf05769b0c164deb6cb1bd62a1cbafb07a94931723eae410e79301" id=1c284dd0-4ace-4bfe-8d52-e0ba4800c8a9 name=/runtime.v1.Run
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.301247991+08:00" level=info msg="Got pod network &{Name:dim2hudi-2bc8818dc5f333b7-exec-1 Namespace:cdp ID:5191dc7b42bf05769b0c164deb6cb1bd62a1cbafb07a94931723eae410e79301 UID:31dba5
4月 11 22:52:11 node7 crio[6968]: time="2024-04-11 22:52:11.301418523+08:00" level=info msg="Deleting pod cdp_dim2hudi-2bc8818dc5f333b7-exec-1 from CNI network \"k8s-pod-network\" (type=calico)"
4月 11 22:54:11 node7 crio[6968]: time="2024-04-11 22:54:11.313770085+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=d0ba6b91-38d9-48e9-aa6d-ed20484c631c name=/runtime.v1.ImageService/ImageStatus
4月 11 22:54:11 node7 crio[6968]: time="2024-04-11 22:54:11.314008887+08:00" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c,RepoTags:[registry.aliyuncs.com/
4月 11 22:58:11 node7 crio[6968]: time="2024-04-11 22:58:11.338855472+08:00" level=info msg="Stopped container fa2bd8715b03b25ae517513b8d0b0b67d963239dedcec63e8d293d730e897586: cdp/dim2hudi-2bc8818dc5f333b7-exec-1/spark-kubernetes-executor"
4月 11 23:02:11 node7 crio[6968]: time="2024-04-11 23:02:11.353814587+08:00" level=info msg="Checking image status: registry.aliyuncs.com/linkflow/pause:3.9" id=0ee1ebef-94d7-4c49-926c-bbc228f62040 name=/runtime.v1.ImageService/ImageStatus

@haircommander
Copy link
Member

how are the resources on the node? is system.slice big enough? I haven't seen this exact thing happen but I'm wondering if cri-o is being throttled or something. The other thought is something is weird with localhost communication so some messages aren't getting connected. both of these guesses are long shots though

@pycgo
Copy link
Author

pycgo commented Apr 12, 2024

no resources problem。 one of this cluster just install k8s and not running program,about two months later ,one node down(has 9 node),But some nodes in the same cluster haven't had any problems for a year。

@pycgo
Copy link
Author

pycgo commented Apr 12, 2024

[root@node7 cni]# free -g
              total        used        free      shared  buff/cache   available
Mem:            125           2         114           0           9         122
Swap:             0           0           0
[root@node7 cni]#

@pycgo
Copy link
Author

pycgo commented Apr 12, 2024

how are the resources on the node? is system.slice big enough? I haven't seen this exact thing happen but I'm wondering if cri-o is being throttled or something. The other thought is something is weird with localhost communication so some messages aren't getting connected. both of these guesses are long shots though

How did you identify which log reflects the communication issue?

@haircommander
Copy link
Member

haircommander commented Apr 19, 2024

well I see cri-o doing some communicating with the kubelet as it's reacting to stop container requests. But for some reason the execs are timing out. maybe they're taking too long? can you try enabling debug log_level in cri-o to see if that helps find the issue?

@pycgo
Copy link
Author

pycgo commented Apr 19, 2024

well I see cri-o doing some communicating with the kubelet as it's reacting to stop container requests. But for some reason the execs are timing out. maybe they're taking too long? can you try enabling debug log_level in cri-o to see if that helps find the issue?好吧,我看到 cri-o 正在与 kubelet 进行一些通信,因为它正在响应停止容器请求。但出于某种原因,高管们正在超时。也许他们花了太长时间?您可以尝试在 CRI-O 中启用调试log_level,看看这是否有助于发现问题?

Okay, I can start debugging, but reproducing the problem may take a long time

Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2024
@kwilczynski kwilczynski removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants