-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace lifecycled and self-termination with ASG lifecycle hooks #964
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This will be started by an SSM action run by a lifecycle hook to enable warm pool
keithduncan
commented
Nov 23, 2021
@@ -176,7 +176,6 @@ experiment="${BUILDKITE_AGENT_EXPERIMENTS}" | |||
priority=%n | |||
spawn=${BUILDKITE_AGENTS_PER_INSTANCE} | |||
no-color=true | |||
disconnect-after-idle-timeout=${BUILDKITE_SCALE_IN_IDLE_PERIOD} | |||
disconnect-after-job=${BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need a story for this
keithduncan
force-pushed
the
keithduncan/add-asg-lifecycle
branch
from
November 23, 2021 04:07
bd9b774
to
ca1e1c4
Compare
keithduncan
commented
Nov 23, 2021
Type: AWS::Events::Rule | ||
Properties: | ||
Description: !Sub Run the spot interruption AWS SSM Automation for ${AgentAutoScaleGroup} | ||
EventPattern: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there’s no way to capture just instances that belong to the Auto Scaling group from the event, some tests to perform:
- What happens if an EC2 Spot instance outside the ASG experiences interruption, does the
autoscaling: TerminateInstanceInAutoScalingGroup
action fail and ignore the instance?
keithduncan
force-pushed
the
keithduncan/add-asg-lifecycle
branch
from
November 23, 2021 08:59
bd1dc71
to
c1b4421
Compare
…gGroup action parameters
…thduncan/add-asg-lifecycle
6 tasks
…stanceInAutoScalingGroup
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Experimental look at replacing instance self termination with ASG guided lifecycle management via lifecycle hooks, EventBridge, and SSM.
If this proves reliable, it would be the foundation for adding a series of improvements.
The boot hook enables:
The boot hook could also be made conditional on turning on warm pool to reduce the impact landing this may have on on the reliability of booting agents.
The shutdown hook enables:
MinSize>0
, ensuring agents stay healthyFinally, a global service side spot interruption notice rule triggers instance termination using an SSM Automation that allows us to remove lifecycled from the instances.
This could also ultimately be combined with a termination lambda to guide termination selection towards instances that are presently idle (and the natural race condition there would be handled by the termination lifecycle hook).
Some reasons this could be bad and that we shouldn’t do it:
TODO
DependsOn
section. This should ensure the CloudFormation order of create and destroy is correct, ensure stack create and delete creates resources in the right order, and prevents teardown of the hook roles and SSM automations until the ASG itself has been destroyedFixes #943
Fixes #944