"Usually once" promotions #214

squaremo · 2023-11-23T11:50:16Z

Attempt each ready promotion just once
~~- [ ] Retry failed promotions (needs a minor bit of design -- e.g., retry how many times?)~~
Record particulars of promotion, e.g., PR URL, in status

EDIT: I'm removing the retry requirement from this PR, so I don't tangle the PR in more design. Recording the failure will be enough for here.

- promotion status goes in the environment status, and is optional since it might not be happening - revision gives the logical point in time that the record applies to, so you can judge whether it's relevant or not - lastAttemptedTime is informational - succeeded is true if it's done, false if it needs to be retried - PR details (optional, fill out if it's a PR)

This is the least test that pipeline fills in promotion status (and it doesn't pass, yet).

This way it can be used to check an invariant.

This commit records what happens when a promotion is attempted, in the pipeline status. This is enough to get the test to pass. Two slightly wider changes are included here: - be more careful about checking whether a particular environment status exists, when looping through them; - don't start with an empty map for environment status. This used to be reasonable, since we were building the whole picture up again by looking at the targets; now, it would lose the promotion status, which can't be observed each time. Signed-off-by: Michael Bridgen <[email protected]>

If you've set up promotions via PR, or CI, or (hypothetical) chatbot, there will be a human-scale lag between a promotion being triggered and it taking effect in the target environment. If the controller processes the pipeline during that time, it shouldn't trigger the promotion again. This commit expands the mock promotion machinery into a little state machine, so the gap between promotion trigger and completion can be explicit in test cases. It asserts that no promotion of {env, version} will be attempted twice, which (_provided_ we don't roll any environment back a version) is what we want to prove here. There's a race between the cache getting an object with the promotion status, and the reconciler processing the object again. It's possible for the reconciler, having just triggered a promotion, to run again and see a version of the pipeline object that doesn't have the promotion recorded -- and that will mean attempting the promotion again. In testing, this happens almost always, because patches will cause the object to be requeued before it's finished reconciling, and it'll run again as soon as its exited. I would expect it to also be common in normal operation; and the outcome is that a promotion is re-run, which is not great! To mitigate the race, I've added a predicate to the watch which will drop updates that didn't change the spec. This means that the patches earlier in Reconcile won't cause the object to be requeued, and it is much, much less likely that it'll run again straight away. It's still possible -- maybe the spec changed while it was running. Other, more involved ways to avoid it: - use conflict detection on write to bail before running a promotion, if the observed pipeline has changed - create Promotion objects and guard promotions by trying to create (another form of detecting conflict on write)

If a cache has not yet synced, or a target is not to be found, and the controller has incomplete information, then the safest thing to do is not make any decisions on promotions, and instead requeue the pipeline for another try later. This fixes a problem that was exposed by adding a predicate to the watch on Pipeline: if a cluster cache isn't ready when the controller looks, and the target doesn't exist, there's nothing to trigger another go-round. This presented itself as a problem with the remote and Kustomization tests, which explicitly check for a "not found" target status. (I fixed another problem with those tests, in passing -- if they are going to check for an updated target status, they had better fetch the object again.) Signed-off-by: Michael Bridgen <[email protected]>

squaremo changed the title ~~197 reliable promotions~~ "Usually once" promotions Nov 23, 2023

squaremo added 6 commits December 21, 2023 16:35

Verify pipeline has promotion status

2e031fd

This is the least test that pipeline fills in promotion status (and it doesn't pass, yet).

Generalise the "verify promotion" assertion

d248d75

This way it can be used to check an invariant.

squaremo force-pushed the 197-reliable-promotions branch from e3b40a6 to ed4cfb2 Compare December 21, 2023 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Usually once" promotions #214

"Usually once" promotions #214

squaremo commented Nov 23, 2023 •

edited

Loading

"Usually once" promotions #214

Are you sure you want to change the base?

"Usually once" promotions #214

Conversation

squaremo commented Nov 23, 2023 • edited Loading

squaremo commented Nov 23, 2023 •

edited

Loading