Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace lifecycled and self-termination with ASG lifecycle hooks #964

Closed
wants to merge 42 commits into from

Conversation

keithduncan
Copy link
Contributor

@keithduncan keithduncan commented Nov 22, 2021

Experimental look at replacing instance self termination with ASG guided lifecycle management via lifecycle hooks, EventBridge, and SSM.

If this proves reliable, it would be the foundation for adding a series of improvements.

The boot hook enables:

  • instance warm pool by decoupling the agent lifetime and boot from instance boot and delaying agent start until the instance moves to InService

The boot hook could also be made conditional on turning on warm pool to reduce the impact landing this may have on on the reliability of booting agents.

The shutdown hook enables:

  • max instance lifetime, allowing periodic rotation of the base load instances in stacks with a MinSize>0, ensuring agents stay healthy
  • capacity rebalance, enables pre-emptive spot termination ahead of the hard deadline 2 minute termination, giving a longer termination window and less build interruption
  • re-enabling instance refresh, allowing manual refresh of all instances for agent health
  • It would also allow us to remove the confusing-to-customers idle exit of agents on the static MinSize >0 instances

Finally, a global service side spot interruption notice rule triggers instance termination using an SSM Automation that allows us to remove lifecycled from the instances.

This could also ultimately be combined with a termination lambda to guide termination selection towards instances that are presently idle (and the natural race condition there would be handled by the termination lifecycle hook).

Some reasons this could be bad and that we shouldn’t do it:

  • SSM Automation as the executor of the lifecycle hooks could prove unreliable. Spot checking at small scale has shown it to work, but how will it fare at hundreds or thousands of node scale.
  • Removing the idle exit means instances are more likely to be terminated immediately after they finish their work. Scaling will be more responsive overall, however once instances exit more frequently their caches are less likely to be warm. This is a double edged sword, more frequent exits and removal of the idle period will reduce the cost of operating this, as instances will not sit idle for long, but this needs to be compared to the expense of potentially longer builds. Should we in fact combine the idle exit behaviour with a termination lifecycle hook to address ASG initiated termination? Could potentially expose this as a customer facing option and have instances protected from scale in until the "idle exit period" occurs.

TODO

  • Windows support
  • Re-add support for the IdleExit period so that instances aren’t immediately reclaimed on scale-in
  • The lifecycle hooks could be moved to the Auto Scaling group’s LifecycleHookSpecificationList and the IAM roles required for them added to the ASG’s DependsOn section. This should ensure the CloudFormation order of create and destroy is correct, ensure stack create and delete creates resources in the right order, and prevents teardown of the hook roles and SSM automations until the ASG itself has been destroyed
  • Review zero-downtime-ness of this template structure when updating stacks, can it safely replace the asg in an update without interrupting jobs (can it today? probably yes with instance protection from scale in)
  • Note that there has been a change to the service role in the release notes for any release that incorporates this change
  • Filter the "EC2 Spot Instance Interruption Warning" to instances that belong to the auto scaling group being managed, ensure we don’t terminate random instances
  • Re-add support for one-shot jobs, previously this relied on the terminate-instance script running after the agent exited which has been removed. That script and systemd unit dependency should be conditionally added based on whether that template parameter is set to preserve the behaviour under this system. Unlike before, we should disable the agent systemd unit restart completely when this behaviour is desired, so that the agent isn’t allowed to restart and grab a new job.

Fixes #943
Fixes #944

@@ -176,7 +176,6 @@ experiment="${BUILDKITE_AGENT_EXPERIMENTS}"
priority=%n
spawn=${BUILDKITE_AGENTS_PER_INSTANCE}
no-color=true
disconnect-after-idle-timeout=${BUILDKITE_SCALE_IN_IDLE_PERIOD}
disconnect-after-job=${BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need a story for this

@keithduncan keithduncan force-pushed the keithduncan/add-asg-lifecycle branch from bd9b774 to ca1e1c4 Compare November 23, 2021 04:07
Type: AWS::Events::Rule
Properties:
Description: !Sub Run the spot interruption AWS SSM Automation for ${AgentAutoScaleGroup}
EventPattern:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there’s no way to capture just instances that belong to the Auto Scaling group from the event, some tests to perform:

  • What happens if an EC2 Spot instance outside the ASG experiences interruption, does the autoscaling: TerminateInstanceInAutoScalingGroup action fail and ignore the instance?

@keithduncan keithduncan force-pushed the keithduncan/add-asg-lifecycle branch from bd1dc71 to c1b4421 Compare November 23, 2021 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable AzRebalance and Capacity Rebalance processes Support Auto Scaling group Instance Refresh
2 participants