New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reboots do not work when using reboot resource set to delay_mins 0 #1801
Comments
I have an update on this issue. This seems to work as long as you do something like the following: reboot 'default' do
delay_mins 1
action :reboot_now
end It does not work (most of the time) if you just do the following: reboot 'default' do
action :reboot_now
end |
I would like to weigh in on this issue, as I am also facing similar challenges with the way test-kitchen behaves, although I'm still trying to understand why. From what I have been able to piece together, when you use a reboot resource, Chef will ultimately execute this method: https://github.com/chef/chef/blob/v17.6.2/lib/chef/platform/rebooter.rb#L34 The problem here is that once you execute those shutdown commands, things move very fast. I haven't tested on Linux VM's (like I assume you do), but the behavior is likely similar. chef-client tries to terminate in a controlled manner when reboots are ordered, by immediately throwing an exception, this exception is caught by an error handler which maps it to an exit code and exits with the exit code 35. This exit code is then returned by the ssh or winrm transports to kitchen, which then concludes failure... unless the Regardless.... there seems to be a race condition here, because the shutdown executable will more or less immediately instruct all running processes to terminate - including the Ruby process that powers chef-client inside the VM. I have not been able to verify it with chef-client as it terminates so quickly (and I don't know how to send fake signals on Windows), but Ruby seems to react to Windows signals by immediately exiting and chef-client does not seem to have anything that hooks into that behavior and delays the termination in time for a clean termination. As a result, I suspect that the "clean" exit that chef-client tries to obtain with the exit code 35 doesn't happen if Ruby dies before it resumes from the shell execution of shutdown, raises the reboot exception and finally handles it to map it to an exit code and exits. And even if chef-client is able to exit cleanly, the news might not reach test-kitchen if the SSH server or WinRM server is killed before it can relay the exit code (and close the session). Your error
Seems to suggest that your VM's SSH server has indeed been killed and the TCP connection closed abruptly. In other words, I think the reason why delay_mins appears to work is that it ensures Chef can exit the way it wants and the transport (SSH or WinRM) can conclude the command (chef-client) failed due to a reboot order and close gracefully, because it will be another 60 seconds before the OS pulls the plug. The only immediate solution I can think of here would be for chef-client to run the shutdown command in a seperate detached process with 1-2 second delay and then raise the reboot exception immediately, so chef-client can exit on its own terms just before the OS ends it. But I'm not a contributor (yet), so I wouldn't know if that would be acceptable or where to shove it in. I would be nice if a chef maintainer could comment on this issue and confirm (or correct) these findings... |
@ramereth @nielsbuus We had this issue in past as well, |
Hi
and doesn’t work in case
but when I try, it (Kitchen converge )fails in both case Kitchen.yml
recipes/default.rb
|
🗣️ Foreword
Thank for taking the time to fill this bug report fully. Without it we may not be able to fix the bug, and the issue may be closed without resolution.
👻 Brief Description
Using a
reboot
resource to:reboot_new
to reboot a VM using test-kitchen fails to work.Version
Test Kitchen version: 3.0.0
Environment
sous-chefs/apparmor#28
Scenario
Rebooting a VM to test disabling AppArmor.
Steps to Reproduce
See PR above for an example and try any of the
disable
suites.Expected Result
Rebooting a VM without an error.
Actual Result
➕ Additional context
Add any other context about the problem here. e.g. related issues or existing pull requests.
The text was updated successfully, but these errors were encountered: