-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Usually once" promotions #214
Draft
squaremo
wants to merge
6
commits into
main
Choose a base branch
from
197-reliable-promotions
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- promotion status goes in the environment status, and is optional since it might not be happening - revision gives the logical point in time that the record applies to, so you can judge whether it's relevant or not - lastAttemptedTime is informational - succeeded is true if it's done, false if it needs to be retried - PR details (optional, fill out if it's a PR)
This is the least test that pipeline fills in promotion status (and it doesn't pass, yet).
This way it can be used to check an invariant.
This commit records what happens when a promotion is attempted, in the pipeline status. This is enough to get the test to pass. Two slightly wider changes are included here: - be more careful about checking whether a particular environment status exists, when looping through them; - don't start with an empty map for environment status. This used to be reasonable, since we were building the whole picture up again by looking at the targets; now, it would lose the promotion status, which can't be observed each time. Signed-off-by: Michael Bridgen <[email protected]>
If you've set up promotions via PR, or CI, or (hypothetical) chatbot, there will be a human-scale lag between a promotion being triggered and it taking effect in the target environment. If the controller processes the pipeline during that time, it shouldn't trigger the promotion again. This commit expands the mock promotion machinery into a little state machine, so the gap between promotion trigger and completion can be explicit in test cases. It asserts that no promotion of {env, version} will be attempted twice, which (_provided_ we don't roll any environment back a version) is what we want to prove here. There's a race between the cache getting an object with the promotion status, and the reconciler processing the object again. It's possible for the reconciler, having just triggered a promotion, to run again and see a version of the pipeline object that doesn't have the promotion recorded -- and that will mean attempting the promotion again. In testing, this happens almost always, because patches will cause the object to be requeued before it's finished reconciling, and it'll run again as soon as its exited. I would expect it to also be common in normal operation; and the outcome is that a promotion is re-run, which is not great! To mitigate the race, I've added a predicate to the watch which will drop updates that didn't change the spec. This means that the patches earlier in Reconcile won't cause the object to be requeued, and it is much, much less likely that it'll run again straight away. It's still possible -- maybe the spec changed while it was running. Other, more involved ways to avoid it: - use conflict detection on write to bail before running a promotion, if the observed pipeline has changed - create Promotion objects and guard promotions by trying to create (another form of detecting conflict on write)
If a cache has not yet synced, or a target is not to be found, and the controller has incomplete information, then the safest thing to do is not make any decisions on promotions, and instead requeue the pipeline for another try later. This fixes a problem that was exposed by adding a predicate to the watch on Pipeline: if a cluster cache isn't ready when the controller looks, and the target doesn't exist, there's nothing to trigger another go-round. This presented itself as a problem with the remote and Kustomization tests, which explicitly check for a "not found" target status. (I fixed another problem with those tests, in passing -- if they are going to check for an updated target status, they had better fetch the object again.) Signed-off-by: Michael Bridgen <[email protected]>
squaremo
force-pushed
the
197-reliable-promotions
branch
from
December 21, 2023 17:12
e3b40a6
to
ed4cfb2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
- [ ] Retry failed promotions (needs a minor bit of design -- e.g., retry how many times?)EDIT: I'm removing the retry requirement from this PR, so I don't tangle the PR in more design. Recording the failure will be enough for here.