Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Usually once" promotions #214

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

"Usually once" promotions #214

wants to merge 6 commits into from

Commits on Dec 21, 2023

  1. Add promotion status to API types

     - promotion status goes in the environment status, and is optional since it might not be happening
       - revision gives the logical point in time that the record applies to, so you can judge whether it's relevant or not
       - lastAttemptedTime is informational
       - succeeded is true if it's done, false if it needs to be retried
     - PR details (optional, fill out if it's a PR)
    squaremo committed Dec 21, 2023
    Configuration menu
    Copy the full SHA
    d2c3c03 View commit details
    Browse the repository at this point in the history
  2. Verify pipeline has promotion status

    This is the least test that pipeline fills in promotion status (and it
    doesn't pass, yet).
    squaremo committed Dec 21, 2023
    Configuration menu
    Copy the full SHA
    2e031fd View commit details
    Browse the repository at this point in the history
  3. Generalise the "verify promotion" assertion

    This way it can be used to check an invariant.
    squaremo committed Dec 21, 2023
    Configuration menu
    Copy the full SHA
    d248d75 View commit details
    Browse the repository at this point in the history
  4. Record promotion status

    This commit records what happens when a promotion is attempted, in the
    pipeline status. This is enough to get the test to pass.
    
    Two slightly wider changes are included here:
    
     - be more careful about checking whether a particular environment
       status exists, when looping through them;
    
     - don't start with an empty map for environment status. This used to
       be reasonable, since we were building the whole picture up again by
       looking at the targets; now, it would lose the promotion
       status, which can't be observed each time.
    
    Signed-off-by: Michael Bridgen <[email protected]>
    squaremo committed Dec 21, 2023
    Configuration menu
    Copy the full SHA
    f852a07 View commit details
    Browse the repository at this point in the history
  5. Check that a promotion isn't invoked repeatedly

    If you've set up promotions via PR, or CI, or (hypothetical) chatbot,
    there will be a human-scale lag between a promotion being triggered
    and it taking effect in the target environment. If the controller
    processes the pipeline during that time, it shouldn't trigger the
    promotion again.
    
    This commit expands the mock promotion machinery into a little state
    machine, so the gap between promotion trigger and completion can be
    explicit in test cases. It asserts that no promotion of {env, version}
    will be attempted twice, which (_provided_ we don't roll any
    environment back a version) is what we want to prove here.
    
    There's a race between the cache getting an object with the promotion
    status, and the reconciler processing the object again. It's possible
    for the reconciler, having just triggered a promotion, to run again
    and see a version of the pipeline object that doesn't have the
    promotion recorded -- and that will mean attempting the promotion
    again.
    
    In testing, this happens almost always, because patches will cause the
    object to be requeued before it's finished reconciling, and it'll run
    again as soon as its exited. I would expect it to also be common in
    normal operation; and the outcome is that a promotion is re-run, which
    is not great!
    
    To mitigate the race, I've added a predicate to the watch which will
    drop updates that didn't change the spec. This means that the patches
    earlier in Reconcile won't cause the object to be requeued, and it is
    much, much less likely that it'll run again straight away. It's still
    possible -- maybe the spec changed while it was running.
    
    Other, more involved ways to avoid it:
    
     - use conflict detection on write to bail before running a promotion,
       if the observed pipeline has changed
    
     - create Promotion objects and guard promotions by trying to create
       (another form of detecting conflict on write)
    squaremo committed Dec 21, 2023
    Configuration menu
    Copy the full SHA
    9244c67 View commit details
    Browse the repository at this point in the history
  6. Requeue when the pipeline isn't ready

    If a cache has not yet synced, or a target is not to be found, and the
    controller has incomplete information, then the safest thing to do is
    not make any decisions on promotions, and instead requeue the
    pipeline for another try later.
    
    This fixes a problem that was exposed by adding a predicate to the
    watch on Pipeline: if a cluster cache isn't ready when the controller
    looks, and the target doesn't exist, there's nothing to trigger
    another go-round. This presented itself as a problem with the remote
    and Kustomization tests, which explicitly check for a "not found"
    target status.
    
    (I fixed another problem with those tests, in passing -- if they are
    going to check for an updated target status, they had better fetch the
    object again.)
    
    Signed-off-by: Michael Bridgen <[email protected]>
    squaremo committed Dec 21, 2023
    Configuration menu
    Copy the full SHA
    ed4cfb2 View commit details
    Browse the repository at this point in the history