Improve service rolling update #7576

davidMcneil · 2020-03-20T18:27:14Z

This epic explores making service rolling updates more useful.

The current rolling update functionality works as follows:

Service group elects an update leader
Update leader continuously checks for updates
When an update is found, update leader updates
_Download new package
_Stop the service
* Start the service
Each follower updates one after another

This progression of events leaves room for some improvement:

It would be nice if we waited for the update to be successful before proceeding to the next update.
- Should we only ensure the leader was successful or should we ensure each peer was successful?
- What should be the measure of success? Healthchecks seem like the natural choice, but we could add another hook if we wanted even more configurability.
What if the service is not ready to update (eg it needs to finish processing a request, it needs to drain a queue)? It would be nice if the service could indicate that it was ready for an update.

Aha! Link: https://chef.aha.io/features/SH-103

danielcbright · 2020-04-01T14:24:55Z

hey @davidMcneil ! I just wanted to add some thoughts here as I'm working with customers in the field to implement hab:

Expected behavior (redis cluster for example):

start the hab sup on 6 nodes, in a supervisor ring, with 3 permanent peers
hab svc load my redis package, configured to do rolling updates from stable
hab pkg upload/promote my redis package to stable, while following all of my nodes with journalctl -fu hab-sup at the same time, I see the update leader election happens properly, which is great.
habitat re-runs the install hook since I've modified something in it - also good
habitat should run the run hook through to completion, marking itself "healthy" in the supervisor ring
only after the node that is currently updating marks itself "healthy" should the next node in line to update start the process.
the update process should also be able to survive reboots, as there are certain circumstances where a cluster needs to have nodes rebooted one after another - as long as the supervisor comes up and runs the run hook and marks itself healthy, this shouldn't be a blocker

Observed behavior (redis cluster for example):

GOOD start the hab sup on 6 nodes, in a supervisor ring, with 3 permanent peers
GOOD hab svc load my redis package, configured to do rolling updates from stable
GOOD hab pkg upload/promote my redis package to stable, while following all of my nodes with journalctl -fu hab-sup at the same time, I see the update leader election happens properly, which is great.
GOOD habitat re-runs the install hook since I've modified something in it - also good
BAD habitat then runs the run hook, and before it's even finished running, that run hook firing seems to trigger the next node to start updating
BAD if my install hook has a reboot in it, the next update node will start it's update process around 10 seconds after the node reboots and departs from the supervisor ring, there should be a tunable (and maybe there is and I'm unaware) that allows the supervisor ring to survive a reboot and continue the process.
BAD if something happens and I have to go in and manually do some things to save my cluster, there doesn't seem to be a way to halt the update process mid-update, once things are set in motion, they play out all the way to the end

TL;DR - the rolling update process should allow for an explicitly set condition to mark a node's update as "successful" and then move on to the next node, without concern of what happens in-between

davidMcneil · 2020-05-05T02:29:03Z

This is an attempt to summarize the specific feature requests and open questions on how to implement those features:

Wait until the leader successfully updates before continuing to the next peer (Rolling Update Strategy does not Respect Health Check #7324)
- How do we qualify/quantify a successful update?
- Should we also require the previous peer to update successfully before starting the current peer?
A way to halt the update process or recover from a failing update
- [Assuming number 1 is implemented] If the leader never successfully updates how can we cancel the current update
Allow the service to indicate when it is ready to update (eg it needs to finish processing a request or drain a queue)
- The Supervisor needs a way to tell the service that an update is available. The service needs a way to tell the Supervisor it is ready for the update. How should this communication happen?
Survive node reboots as part of the update process
- This would be very difficult under the current architecture because when a node disappears it is completely removed from the gossip ring

These are my thoughts on the open questions. I think addressing 1 and 2 have the most payoff (ie solve the biggest pain points) so I will only address those for now.

How do we qualify/quantify a successful update?

A successful health check following the update indicates that the update was successful

Should we also require the previous peer to update successfully before starting the current peer?

I can see arguments for both implementations, but I think the correct answer is "yes".

[Assuming number 1 Is implemented] If the leader never successfully updates how can we cancel the current update

If the package gets demoted from the channel, that indicates that the update should be canceled.

Roadmap for implementing features 1 and 2.

Add the last health check result to the gossip protocol
Add logic to check the previous peer's health check result before starting an update
Add logic to cancel the update when the package is demoted from the channel

Related Issues - #3249 #5325 #5326 #7324

sdelano · 2020-10-14T04:23:51Z

Comment added by Prashanth Nanjundappa in Aha! - View

@Trevor Hess who are the customers who have asked for this ? could we tag them in EPIC

sdelano · 2021-01-29T17:21:12Z

Comment added by Lisa Stidham in Aha! - View

Issue 7576

stale · 2022-01-30T20:18:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

stale · 2022-03-10T22:29:26Z

This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.

davidMcneil added A-supervisor Epic labels Mar 20, 2020

krasnow assigned davidMcneil Apr 3, 2020

jsirex mentioned this issue Apr 27, 2020

Yet another review on hooks #7645

Closed

krasnow added the needs-discovery label Apr 29, 2020

davidMcneil removed their assignment May 4, 2020

sdelano removed A-supervisor labels Jun 11, 2020

stale bot added the Stale label Jan 30, 2022

stale bot closed this as completed Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve service rolling update #7576

Improve service rolling update #7576

davidMcneil commented Mar 20, 2020 •

edited by sdelano

Loading

danielcbright commented Apr 1, 2020 •

edited

Loading

davidMcneil commented May 5, 2020 •

edited

Loading

sdelano commented Oct 14, 2020

sdelano commented Jan 29, 2021

stale bot commented Jan 30, 2022

stale bot commented Mar 10, 2022

Improve service rolling update #7576

Improve service rolling update #7576

Comments

davidMcneil commented Mar 20, 2020 • edited by sdelano Loading

danielcbright commented Apr 1, 2020 • edited Loading

davidMcneil commented May 5, 2020 • edited Loading

sdelano commented Oct 14, 2020

sdelano commented Jan 29, 2021

stale bot commented Jan 30, 2022

stale bot commented Mar 10, 2022

davidMcneil commented Mar 20, 2020 •

edited by sdelano

Loading

danielcbright commented Apr 1, 2020 •

edited

Loading

davidMcneil commented May 5, 2020 •

edited

Loading