Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure controller and tests shut down cleanly #215

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

squaremo
Copy link
Contributor

I noticed that etcd and kube-apiserver processes were hanging around after the tests had completed, and this prompted me to make sure everything in controllers/leveltriggered/ was shutting down cleanly. So:

The level-triggered controller needs to keep a set of caches, and to shut them down when it no longer needs them. To do this it uses a couple of thin abstractions -- one to keep track of the cache goroutines (runner), one to garbage collect unused caches (gc). Of these, runner starts its own goroutines, so to shut down gracefully it needs to wait for them all to exit before exiting itself. Exiting without waiting for cache goroutines seems to be the cause of kube-apiserver not terminating -- I think because they leave watches in an indeterminate state.

So:

  • use a WaitGroup to make sure all the cache goroutines have finished
  • log all these things: starting up and especially, shutting down

This isn't a massive difference, just a little neater.

Signed-off-by: Michael Bridgen <[email protected]>
This commit

 - fixes a problem whereby the runner loop didn't exit (the good old
   "break out of the select but not out of the for" mistake)
 - uses a WaitGroup to make sure all the runner children (i.e., client
   caches) have exited
 - puts in more logging of things starting up and shutting down

Signed-off-by: Michael Bridgen <[email protected]>
Signed-off-by: Michael Bridgen <[email protected]>
@squaremo squaremo marked this pull request as draft November 27, 2023 14:19
@squaremo
Copy link
Contributor Author

Hmm no, this takes care to tell everything to shut down, and waits for it, but at the point it exits there's still two HTTP connections open. It's tricky to tell why, because the stack trace does not go back to what's actually using them. But I strongly suspect they are Kubernetes API client connections. This has implications outside testing -- it might mean the controller fails to shut down gracefully.

As part of testing pipelines with targets in remote clusters, an
envtest Environment `leafEnv` is created to act as a remote
cluster. But it is left running after the tests, as one could see with
`ps` once `go test` has completed.

Since the controller is run in TestMain, the environment must be shut
down in TestMain -- i.e., not within an individual test. This commit
adds a simple mechanism for keeping track of envs to shut down after
the tests have run, and uses it both for the "main" cluster
environment and the leaf cluster.

Signed-off-by: Michael Bridgen <[email protected]>
@squaremo
Copy link
Contributor Author

squaremo commented Dec 21, 2023

at the point it exits there's still two HTTP connections open

Yiannis obligingly ran through this with me, from scratch, and what we saw was 1. on tracing the code through apimachinery and client-go/tools, it appears that cancelling the context used to start a cache should stop everything, as expected; and 2. that when the tests were completed and all cleanup had been done, there were still two goroutines blocked on waiting for processes to exit. Then it clicked -- these were the etcd and kube-apiserver processes for the envtest.Environment used to represent a leaf cluster, leafEnv; and lo, leafEnv.Stop() is not called. The last commit arranges for this to happen.

@squaremo squaremo changed the title Make sure envtest can shut down cleanly Make sure controller and tests shut down cleanly Dec 21, 2023
@squaremo squaremo marked this pull request as ready for review December 21, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant