-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure controller and tests shut down cleanly #215
base: main
Are you sure you want to change the base?
Conversation
This isn't a massive difference, just a little neater. Signed-off-by: Michael Bridgen <[email protected]>
This commit - fixes a problem whereby the runner loop didn't exit (the good old "break out of the select but not out of the for" mistake) - uses a WaitGroup to make sure all the runner children (i.e., client caches) have exited - puts in more logging of things starting up and shutting down Signed-off-by: Michael Bridgen <[email protected]>
Signed-off-by: Michael Bridgen <[email protected]>
905e23f
to
2619dfc
Compare
Hmm no, this takes care to tell everything to shut down, and waits for it, but at the point it exits there's still two HTTP connections open. It's tricky to tell why, because the stack trace does not go back to what's actually using them. But I strongly suspect they are Kubernetes API client connections. This has implications outside testing -- it might mean the controller fails to shut down gracefully. |
As part of testing pipelines with targets in remote clusters, an envtest Environment `leafEnv` is created to act as a remote cluster. But it is left running after the tests, as one could see with `ps` once `go test` has completed. Since the controller is run in TestMain, the environment must be shut down in TestMain -- i.e., not within an individual test. This commit adds a simple mechanism for keeping track of envs to shut down after the tests have run, and uses it both for the "main" cluster environment and the leaf cluster. Signed-off-by: Michael Bridgen <[email protected]>
Yiannis obligingly ran through this with me, from scratch, and what we saw was 1. on tracing the code through apimachinery and client-go/tools, it appears that cancelling the context used to start a cache should stop everything, as expected; and 2. that when the tests were completed and all cleanup had been done, there were still two goroutines blocked on waiting for processes to exit. Then it clicked -- these were the etcd and kube-apiserver processes for the |
I noticed that
etcd
andkube-apiserver
processes were hanging around after the tests had completed, and this prompted me to make sure everything in controllers/leveltriggered/ was shutting down cleanly. So:The level-triggered controller needs to keep a set of caches, and to shut them down when it no longer needs them. To do this it uses a couple of thin abstractions -- one to keep track of the cache goroutines (
runner
), one to garbage collect unused caches (gc
). Of these,runner
starts its own goroutines, so to shut down gracefully it needs to wait for them all to exit before exiting itself. Exiting without waiting for cache goroutines seems to be the cause of kube-apiserver not terminating -- I think because they leave watches in an indeterminate state.So: