Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve service rolling update #7576

Closed
davidMcneil opened this issue Mar 20, 2020 · 6 comments
Closed

Improve service rolling update #7576

davidMcneil opened this issue Mar 20, 2020 · 6 comments

Comments

@davidMcneil
Copy link
Contributor

davidMcneil commented Mar 20, 2020

This epic explores making service rolling updates more useful.

The current rolling update functionality works as follows:

  1. Service group elects an update leader
  2. Update leader continuously checks for updates
  3. When an update is found, update leader updates
    _Download new package
    _Stop the service
    * Start the service
  4. Each follower updates one after another

This progression of events leaves room for some improvement:

  1. It would be nice if we waited for the update to be successful before proceeding to the next update.

    • Should we only ensure the leader was successful or should we ensure each peer was successful?
    • What should be the measure of success? Healthchecks seem like the natural choice, but we could add another hook if we wanted even more configurability.
  2. What if the service is not ready to update (eg it needs to finish processing a request, it needs to drain a queue)? It would be nice if the service could indicate that it was ready for an update.

Aha! Link: https://chef.aha.io/features/SH-103

@danielcbright
Copy link

danielcbright commented Apr 1, 2020

hey @davidMcneil ! I just wanted to add some thoughts here as I'm working with customers in the field to implement hab:

Expected behavior (redis cluster for example):

  • start the hab sup on 6 nodes, in a supervisor ring, with 3 permanent peers
  • hab svc load my redis package, configured to do rolling updates from stable
  • hab pkg upload/promote my redis package to stable, while following all of my nodes with journalctl -fu hab-sup at the same time, I see the update leader election happens properly, which is great.
  • habitat re-runs the install hook since I've modified something in it - also good
  • habitat should run the run hook through to completion, marking itself "healthy" in the supervisor ring
  • only after the node that is currently updating marks itself "healthy" should the next node in line to update start the process.
  • the update process should also be able to survive reboots, as there are certain circumstances where a cluster needs to have nodes rebooted one after another - as long as the supervisor comes up and runs the run hook and marks itself healthy, this shouldn't be a blocker

Observed behavior (redis cluster for example):

  • GOOD start the hab sup on 6 nodes, in a supervisor ring, with 3 permanent peers
  • GOOD hab svc load my redis package, configured to do rolling updates from stable
  • GOOD hab pkg upload/promote my redis package to stable, while following all of my nodes with journalctl -fu hab-sup at the same time, I see the update leader election happens properly, which is great.
  • GOOD habitat re-runs the install hook since I've modified something in it - also good
  • BAD habitat then runs the run hook, and before it's even finished running, that run hook firing seems to trigger the next node to start updating
  • BAD if my install hook has a reboot in it, the next update node will start it's update process around 10 seconds after the node reboots and departs from the supervisor ring, there should be a tunable (and maybe there is and I'm unaware) that allows the supervisor ring to survive a reboot and continue the process.
  • BAD if something happens and I have to go in and manually do some things to save my cluster, there doesn't seem to be a way to halt the update process mid-update, once things are set in motion, they play out all the way to the end

TL;DR - the rolling update process should allow for an explicitly set condition to mark a node's update as "successful" and then move on to the next node, without concern of what happens in-between

@davidMcneil
Copy link
Contributor Author

davidMcneil commented May 5, 2020

This is an attempt to summarize the specific feature requests and open questions on how to implement those features:

  1. Wait until the leader successfully updates before continuing to the next peer (Rolling Update Strategy does not Respect Health Check #7324)
    • How do we qualify/quantify a successful update?
    • Should we also require the previous peer to update successfully before starting the current peer?
  2. A way to halt the update process or recover from a failing update
    • [Assuming number 1 is implemented] If the leader never successfully updates how can we cancel the current update
  3. Allow the service to indicate when it is ready to update (eg it needs to finish processing a request or drain a queue)
    • The Supervisor needs a way to tell the service that an update is available. The service needs a way to tell the Supervisor it is ready for the update. How should this communication happen?
  4. Survive node reboots as part of the update process
    • This would be very difficult under the current architecture because when a node disappears it is completely removed from the gossip ring

These are my thoughts on the open questions. I think addressing 1 and 2 have the most payoff (ie solve the biggest pain points) so I will only address those for now.

How do we qualify/quantify a successful update?

A successful health check following the update indicates that the update was successful

Should we also require the previous peer to update successfully before starting the current peer?

I can see arguments for both implementations, but I think the correct answer is "yes".

[Assuming number 1 Is implemented] If the leader never successfully updates how can we cancel the current update

If the package gets demoted from the channel, that indicates that the update should be canceled.

Roadmap for implementing features 1 and 2.

  1. Add the last health check result to the gossip protocol
  2. Add logic to check the previous peer's health check result before starting an update
  3. Add logic to cancel the update when the package is demoted from the channel

Related Issues - #3249 #5325 #5326 #7324

@sdelano
Copy link

sdelano commented Oct 14, 2020

Comment added by Prashanth Nanjundappa in Aha! - View

@Trevor Hess who are the customers who have asked for this ? could we tag them in EPIC

@sdelano
Copy link

sdelano commented Jan 29, 2021

Comment added by Lisa Stidham in Aha! - View

Issue 7576

@stale
Copy link

stale bot commented Jan 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Jan 30, 2022
@stale
Copy link

stale bot commented Mar 10, 2022

This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.

@stale stale bot closed this as completed Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants