Replace lifecycled and self-termination with ASG lifecycle hooks #964

keithduncan · 2021-11-22T05:10:52Z

Experimental look at replacing instance self termination with ASG guided lifecycle management via lifecycle hooks, EventBridge, and SSM.

If this proves reliable, it would be the foundation for adding a series of improvements.

The boot hook enables:

instance warm pool by decoupling the agent lifetime and boot from instance boot and delaying agent start until the instance moves to InService

The boot hook could also be made conditional on turning on warm pool to reduce the impact landing this may have on on the reliability of booting agents.

The shutdown hook enables:

max instance lifetime, allowing periodic rotation of the base load instances in stacks with a MinSize>0, ensuring agents stay healthy
capacity rebalance, enables pre-emptive spot termination ahead of the hard deadline 2 minute termination, giving a longer termination window and less build interruption
re-enabling instance refresh, allowing manual refresh of all instances for agent health
It would also allow us to remove the confusing-to-customers idle exit of agents on the static MinSize >0 instances

Finally, a global service side spot interruption notice rule triggers instance termination using an SSM Automation that allows us to remove lifecycled from the instances.

This could also ultimately be combined with a termination lambda to guide termination selection towards instances that are presently idle (and the natural race condition there would be handled by the termination lifecycle hook).

Some reasons this could be bad and that we shouldn’t do it:

SSM Automation as the executor of the lifecycle hooks could prove unreliable. Spot checking at small scale has shown it to work, but how will it fare at hundreds or thousands of node scale.
Removing the idle exit means instances are more likely to be terminated immediately after they finish their work. Scaling will be more responsive overall, however once instances exit more frequently their caches are less likely to be warm. This is a double edged sword, more frequent exits and removal of the idle period will reduce the cost of operating this, as instances will not sit idle for long, but this needs to be compared to the expense of potentially longer builds. Should we in fact combine the idle exit behaviour with a termination lifecycle hook to address ASG initiated termination? Could potentially expose this as a customer facing option and have instances protected from scale in until the "idle exit period" occurs.

TODO

Windows support
Re-add support for the IdleExit period so that instances aren’t immediately reclaimed on scale-in
The lifecycle hooks could be moved to the Auto Scaling group’s LifecycleHookSpecificationList and the IAM roles required for them added to the ASG’s DependsOn section. This should ensure the CloudFormation order of create and destroy is correct, ensure stack create and delete creates resources in the right order, and prevents teardown of the hook roles and SSM automations until the ASG itself has been destroyed
Review zero-downtime-ness of this template structure when updating stacks, can it safely replace the asg in an update without interrupting jobs (can it today? probably yes with instance protection from scale in)
Note that there has been a change to the service role in the release notes for any release that incorporates this change
Filter the "EC2 Spot Instance Interruption Warning" to instances that belong to the auto scaling group being managed, ensure we don’t terminate random instances
Re-add support for one-shot jobs, previously this relied on the terminate-instance script running after the agent exited which has been removed. That script and systemd unit dependency should be conditionally added based on whether that template parameter is set to preserve the behaviour under this system. Unlike before, we should disable the agent systemd unit restart completely when this behaviour is desired, so that the agent isn’t allowed to restart and grab a new job.

Fixes #943
Fixes #944

This will be started by an SSM action run by a lifecycle hook to enable warm pool

…tions

keithduncan · 2021-11-23T00:07:29Z

packer/linux/conf/bin/bk-install-elastic-stack.sh

@@ -176,7 +176,6 @@ experiment="${BUILDKITE_AGENT_EXPERIMENTS}"
 priority=%n
 spawn=${BUILDKITE_AGENTS_PER_INSTANCE}
 no-color=true
-disconnect-after-idle-timeout=${BUILDKITE_SCALE_IN_IDLE_PERIOD}
 disconnect-after-job=${BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB}


Will need a story for this

keithduncan · 2021-11-23T06:42:53Z

templates/aws-stack.yml

+    Type: AWS::Events::Rule
+    Properties:
+      Description: !Sub Run the spot interruption AWS SSM Automation for ${AgentAutoScaleGroup}
+      EventPattern:


Since there’s no way to capture just instances that belong to the Auto Scaling group from the event, some tests to perform:

What happens if an EC2 Spot instance outside the ASG experiences interruption, does the autoscaling: TerminateInstanceInAutoScalingGroup action fail and ignore the instance?

…gGroup action parameters

…thduncan/add-asg-lifecycle

…stanceInAutoScalingGroup

nitrocode and others added 26 commits May 5, 2021 23:07

Add MaxInstanceLifetime

da91dd9

Remove extra spacing

155478b

Merge branch 'master' into MaxInstanceLifetime

62c209b

Merge branch 'master' into MaxInstanceLifetime

d06ab5e

Remove disconnect after idle setting

913a1b1

Remove terminate instance after agent exit

b458f03

Bump agent scaler and enable scale in

3fca7f3

Remove lifecycled

be157f4

Don't start the agent on boot

3a259c7

This will be started by an SSM action run by a lifecycle hook to enable warm pool

Remove scale in protection

974946d

Remove AZ rebalance suspension lambda

462623d

Add skeleton of hooks and lambdas to respond to them

51188df

Give instances permission to complete lifecycle actions

bb40e70

Make the terminate hook continue by default

5a23ec2

Fix typo in SSM IAM action

9b72181

Remove lambdas

921301a

Add an SSM role and boot hook automation

ba2fd09

Remove instance permission to complete lifecycle actions

9268219

Add a terminate instance ssm automation

9700a02

Move the automation role

42d6860

Add EventBridge rules to route ASG lifecycle events to the SSM automa…

aa1deb9

…tions

Add account id field to the ssm document ARN

56ae297

Add ssm:CreateDocument permission to the service role

5244f5d

Fix capitalisation of the SSM document aws:runCommand

a02572f

Add more ssm document permissions

c1f101e

Add tag actions

448e169

keithduncan commented Nov 23, 2021

View reviewed changes

keithduncan added 3 commits November 23, 2021 11:17

Make systemctl stop wait 1 hour for the process to exit

22fa12b

Fix event pattern structure

63e9a79

Add missing iam permission

223cb37

keithduncan added 4 commits November 23, 2021 13:27

Strings strings strings

d1c312b

Add more automation role iam requirements

624ea72

Add windows support for ssm automations

8bb0941

Limit the event match to the specific hook we have created

ca1e1c4

keithduncan force-pushed the keithduncan/add-asg-lifecycle branch from bd9b774 to ca1e1c4 Compare November 23, 2021 04:07

keithduncan added 2 commits November 23, 2021 14:42

Make the rules for boot and automation depend on the hooks

1352c0d

Add a spot interruption rule and automation that terminates the instance

8ae7f15

keithduncan commented Nov 23, 2021

View reviewed changes

keithduncan added 2 commits November 23, 2021 17:01

Fix reference to the shutdown hook from the shutdown rule

d839604

Make the windows ssm start command timeout 10 minutes

c1b4421

keithduncan force-pushed the keithduncan/add-asg-lifecycle branch from bd1dc71 to c1b4421 Compare November 23, 2021 08:59

Add comment to spot interruption event schema

5675c98

keithduncan mentioned this pull request Nov 23, 2021

Scaler seems to bypass Lifecycle Hooks buildkite/buildkite-agent-scaler#8

Closed

keithduncan added 3 commits November 25, 2021 14:09

Remove auto scaling group name from the TerminateInstanceInAutoScalin…

d48e9a0

…gGroup action parameters

Merge remote-tracking branch 'nitrocode/MaxInstanceLifetime' into kei…

2d30b78

…thduncan/add-asg-lifecycle

Enable capacity rebalancing

d50bb4b

keithduncan mentioned this pull request Nov 25, 2021

Add warm pool with new ASG lifecycle hooks #966

Closed

6 tasks

Give the automation role permission to invoke autoscaling:TerminateIn…

b4dde06

…stanceInAutoScalingGroup

keithduncan mentioned this pull request Dec 7, 2021

Add support for instance cordoning #972

Open

keithduncan closed this Nov 25, 2022

raylu mentioned this pull request Jul 11, 2024

Max instance lifetime #839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace lifecycled and self-termination with ASG lifecycle hooks #964

Replace lifecycled and self-termination with ASG lifecycle hooks #964

keithduncan commented Nov 22, 2021 •

edited

Loading

keithduncan Nov 23, 2021

keithduncan Nov 23, 2021

Replace lifecycled and self-termination with ASG lifecycle hooks #964

Replace lifecycled and self-termination with ASG lifecycle hooks #964

Conversation

keithduncan commented Nov 22, 2021 • edited Loading

keithduncan Nov 23, 2021

Choose a reason for hiding this comment

keithduncan Nov 23, 2021

Choose a reason for hiding this comment

keithduncan commented Nov 22, 2021 •

edited

Loading