Make sure controller and tests shut down cleanly #215

squaremo · 2023-11-23T19:38:52Z

I noticed that etcd and kube-apiserver processes were hanging around after the tests had completed, and this prompted me to make sure everything in controllers/leveltriggered/ was shutting down cleanly. So:

The level-triggered controller needs to keep a set of caches, and to shut them down when it no longer needs them. To do this it uses a couple of thin abstractions -- one to keep track of the cache goroutines (runner), one to garbage collect unused caches (gc). Of these, runner starts its own goroutines, so to shut down gracefully it needs to wait for them all to exit before exiting itself. Exiting without waiting for cache goroutines seems to be the cause of kube-apiserver not terminating -- I think because they leave watches in an indeterminate state.

So:

use a WaitGroup to make sure all the cache goroutines have finished
log all these things: starting up and especially, shutting down

This isn't a massive difference, just a little neater. Signed-off-by: Michael Bridgen <[email protected]>

This commit - fixes a problem whereby the runner loop didn't exit (the good old "break out of the select but not out of the for" mistake) - uses a WaitGroup to make sure all the runner children (i.e., client caches) have exited - puts in more logging of things starting up and shutting down Signed-off-by: Michael Bridgen <[email protected]>

Signed-off-by: Michael Bridgen <[email protected]>

squaremo · 2023-11-27T18:17:28Z

Hmm no, this takes care to tell everything to shut down, and waits for it, but at the point it exits there's still two HTTP connections open. It's tricky to tell why, because the stack trace does not go back to what's actually using them. But I strongly suspect they are Kubernetes API client connections. This has implications outside testing -- it might mean the controller fails to shut down gracefully.

As part of testing pipelines with targets in remote clusters, an envtest Environment `leafEnv` is created to act as a remote cluster. But it is left running after the tests, as one could see with `ps` once `go test` has completed. Since the controller is run in TestMain, the environment must be shut down in TestMain -- i.e., not within an individual test. This commit adds a simple mechanism for keeping track of envs to shut down after the tests have run, and uses it both for the "main" cluster environment and the leaf cluster. Signed-off-by: Michael Bridgen <[email protected]>

squaremo · 2023-12-21T13:46:22Z

at the point it exits there's still two HTTP connections open

Yiannis obligingly ran through this with me, from scratch, and what we saw was 1. on tracing the code through apimachinery and client-go/tools, it appears that cancelling the context used to start a cache should stop everything, as expected; and 2. that when the tests were completed and all cleanup had been done, there were still two goroutines blocked on waiting for processes to exit. Then it clicked -- these were the etcd and kube-apiserver processes for the envtest.Environment used to represent a leaf cluster, leafEnv; and lo, leafEnv.Stop() is not called. The last commit arranges for this to happen.

Make caches create its own event channel

af86737

This isn't a massive difference, just a little neater. Signed-off-by: Michael Bridgen <[email protected]>

squaremo requested a review from yiannistri November 23, 2023 21:00

squaremo added 2 commits November 27, 2023 14:16

Refine the runner doc comment

2619dfc

Signed-off-by: Michael Bridgen <[email protected]>

squaremo force-pushed the shut-down-envtest branch from 905e23f to 2619dfc Compare November 27, 2023 14:17

squaremo marked this pull request as draft November 27, 2023 14:19

squaremo changed the title ~~Make sure envtest can shut down cleanly~~ Make sure controller and tests shut down cleanly Dec 21, 2023

squaremo marked this pull request as ready for review December 21, 2023 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure controller and tests shut down cleanly #215

Make sure controller and tests shut down cleanly #215

squaremo commented Nov 23, 2023

squaremo commented Nov 27, 2023

squaremo commented Dec 21, 2023 •

edited

Loading

Make sure controller and tests shut down cleanly #215

Are you sure you want to change the base?

Make sure controller and tests shut down cleanly #215

Conversation

squaremo commented Nov 23, 2023

squaremo commented Nov 27, 2023

squaremo commented Dec 21, 2023 • edited Loading

squaremo commented Dec 21, 2023 •

edited

Loading