Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ASG Warmpool depend on ASG Lifecycle hook #16583

Merged
merged 1 commit into from
Jun 5, 2024

Conversation

jim-barber-he
Copy link
Contributor

Fixes: #16582

When creating a new ASG with a Warmpool using Lifecycle hooks, the instances that first join the Warmpool when the ASG is created could come up before the Lifecycle hook is in effect.
This can lead to problems such as those instances not calling the Lifecycle hook notification when pressed into service, causing the ASG to terminate them approximately 10 mins after they've been performing work in the cluster.

Setting a dependency on for the Warmpool to take into account any Lifecycle hooks means that the Warmpool won't be created until the hooks are ready.

Fixes: kubernetes#16582

When creating a new ASG with a Warmpool using Lifecycle hooks, the
instances that first join the Warmpool when the ASG is created could
come up before the Lifecycle hook is in effect.
This can lead to problems such as those instances not calling the
Lifecycle hook notification when pressed into service, causing the ASG
to terminate them approximately 10 mins after they've been performing
work in the cluster.

Setting a dependency on for the Warmpool to take into account any
Lifecycle hooks means that the Warmpool won't be created until the
hooks are ready.

Signed-off-by: Jim Barber <[email protected]>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @jim-barber-he. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 21, 2024
@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label May 21, 2024
@justinsb
Copy link
Member

justinsb commented Jun 4, 2024

Question: is this race specific to the instances in the WarmPool, or can it apply to the (main) ASG instances also? I'm wondering if we should be setting LifecycleHookSpecificationList in CreateAutoScalingGroup...

@jim-barber-he
Copy link
Contributor Author

Question: is this race specific to the instances in the WarmPool, or can it apply to the (main) ASG instances also?

It happens to all the instances that are created at the time the ASG is created and before its lifecycle hook is in effect.
But the instances that skipped the Warmpool and went straight into service won't be reaped by the ASG because the Lifecycle hook wasn't active when they joined (even though they tried to send the lifecycle notification but couldn't).

So all hosts are affected, but only the Warmpool ones are reaped when they later go into service, because by that time the Lifecycle hook is in place on the ASG but they don't seem to attempt to call it.

I'm wondering if we should be setting LifecycleHookSpecificationList in CreateAutoScalingGroup...

I attempted to fix it for all hosts using the LifecycleHookSpecificationList and had success in that now no hosts started on the ASG until the hook was in place.
But my change had issues if you tried to edit already existing instances groups to add or remove warmpools.
I'm still learning Go and struggled with getting this work properly, so I gave up.
The Slack thread talking about this with Hakman is here: https://kubernetes.slack.com/archives/C3QUFP0QM/p1715733980241579

Since the first instances that skip the Warmpool when the ASG is created are safe from being reaped by the ASG and it was only the Warmpool instances at risk I figured I'd raise this PR as it gets us out of hot water.
But feel free to close this if you want to develop the LifecycleHookSpecificationList approach, but at the moment that's beyond my abilities.

@justinsb
Copy link
Member

justinsb commented Jun 5, 2024

Thanks @jim-barber-he , this makes a lot of sense and is a good fix for warmpools, even if we end up (somehow) also precreating the hooks with LifecycleHookSpecificationList (for non-warmpools).

Thanks also for the detailed comments on the issue (#16582), I'm going to look that over and see if I have any ideas!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot merged commit e6c27b2 into kubernetes:master Jun 5, 2024
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jun 5, 2024
@jim-barber-he
Copy link
Contributor Author

@justinsb

Are you okay if I raise a PR to cherry pick this back to kOps 1.29 as well?

@justinsb
Copy link
Member

justinsb commented Jun 6, 2024

Are you okay if I raise a PR to cherry pick this back to kOps 1.29 as well?

That would be wonderful, thank you! If you haven't seen them, there is a script you can use described here: https://github.com/kubernetes/kops/blob/master/docs/contributing/proposing-a-cherry-pick.md

@jim-barber-he
Copy link
Contributor Author

@justinsb The cherry-pick PR is here: #16603

k8s-ci-robot added a commit that referenced this pull request Jun 8, 2024
…f-#16583-upstream-release-1.29

Automated cherry pick of #16583: Make ASG Warmpool depend on ASG Lifecycle hook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/aws Issues or PRs related to aws provider blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/office-hours lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ASG Warmpool instances join before Lifecycle hook is in effect
4 participants