-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopping a container blocks all further stop attempts for the same container #8030
Comments
interesting, I would expect this to work given we have a test to check it would: https://github.com/cri-o/cri-o/blob/main/test/ctr.bats#L1150 though, we don't verify we don't wait on the lock... I will try to look at this tomorrow or next week. please poke me if I forget 🙏 |
Hey @haircommander, thanks for the reply! I've looked a bit at the tests you shared and it looks like the same scenario I'm describing. However, if I understand correctly, every If that is the case, I suspect reducing the client timeout (or increasing SIGTERM timeout) will cause the test to fail. That's my take as far as I understand. |
Hi @haircommander, any updates on this? |
What happened?
We've noticed that with version 1.29.1 we no longer can issue multiple
crictl stop
commands for the same container. If there is aStopContainer
operation active, any otherStopContainer
operation will timeout at the client side waiting for the server to take a lock that is hold by the initial stop container operation.Context
In a scenario where you have one container that is unwilling to die gracefully over a long period of time (e.g. a big database that ignores SIGTERM for a long period of time while closing connections), the following regression happens:
crictl stop -t 300 <container>
, will send a SIGTERM and after 300s send a SIGKILL. This command hangs as it should while we wait for the container to die gracefully or get forcefully terminated after 300s.crictl stop -t 0 <container>
to immediately kill it.Currently, the second command will not take effect, and
crictl
will get a context timeout fromcrio
server.The following logs show the symptoms :
This behavior was not present on version 1.27.1 and multiple stop container operations can be issued.
What did you expect to happen?
I'd expect to have the same behavior we did in 1.27; where the second
crictl stop
command goes through and manages to kill the container. The container runtime should honor all stopping requests, even if there is one currently waiting in a longer timeout, we should be able to stop a container forcefully even if we mistakenly or purposely tried with a longer timeout before.I think this might be a regression introduced around the time of this refactor.
How can we reproduce it (as minimally and precisely as possible)?
To reproduce the issue, in a host running k8s with cri-o as container runtime:
This uses a
busibox
image to run asleep
command for 600 seconds. And has a terminationGracePeriodSeconds of 3600.kubectl
. This will triger aStopContiainer
call with a timeout o 3600; which will held the stop lock on crio server preventing any future stops to work.crictl
stop.Attempting to stop with explicit timeouts:
Note: the same can be achieved using
crictl stop -t 3600 <container>
on step 3.Anything else we need to know?
I've run a goroutine dump of the state of my cri-o server during step 4 of the reproduction steps and confirmed that there is at least one goroutine in the method
WaitOnStopTimeout
while all the others cannot enter the stop main method.This is the first
crictl stop
issued, with the high timeout, waiting in this timout channel if I'm not mistaken. It is holding thestopLock
.Here is one of the following
crictl stops
calls, attempting to get the same lock forShouldBeStopped
function.CRI-O and Kubernetes version
OS version
Additional environment details (AWS, VirtualBox, physical, etc.)
The text was updated successfully, but these errors were encountered: