-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect telegraf service termination during operating system reboot #14834
Comments
Hi,
I do not have access to Windows Server or Exchange, so I will not be able to reproduce on a similar system. I do have access to Windows 11 and I do not see any difference in reboot time between before the service was installed, while the service was installed, and again after uninstalling the service. My suggestion to you is to do the following:
|
Hi! I did some research. I consistently rolled back the telegraf version to an earlier one and It was experimentally determined that starting from version v1.18.1 the problem is reproducible stably. On version 1.18.0, the telegraf service is terminated correctly by the operating system and the operating system reboot process occurs without freezing. |
Hi! The problem with rebooting operating system exist on server if MS Exchange installed on it. What was changed from version 1.18.0 to version 1.18.1? |
the config you provided did not even interact with exchange right?
that version is from 8 years ago :D and there appear to be no differences related to your report: 1.8.0...1.8.1 |
Instead of laugh on me, you can install MS Exchange server role and you can reproduce this problem. Telegraf service , running in this environment with "empty" config does not allow operating system do correct reboot. We can create video meeting where I will show you this strange case. |
I'm not sure why you think I am laughing at you, but I certainly was not.
If you are going to be dismissive of this entire conversation and the backend forth we have had then I am happy to close this issue. I have both tried to reproduce this issue with what I have available to me and work to ask for additional details from you. As I started off with, I do not have access to exchange. I am not sure what else a video or a video call will provide in terms of resolving the issue. I need something to point to an actual issue in Telegraf itself. You said this only occurs a) when telegraf is run as a service and b) when exchange is installed. However, that does not actually prove an issue with telegraf if an external service also needs to be installed which we are not interacting with. The minimum config you provided collects the processor stats, nothing from exchange. If you remove the |
Let me also ask how are you creating the service? Because this would also support the idea that telegraf itself is not at fault here and that something with how the service is managed or created is causing issues. |
I replaced input plugin, as you recommended, and nothing was changed. :(
Service installed by following command in powershell: |
I am trying to understand, why telegraf service, running on version 1.18.0 with the same "empty" telegraf config, which does not interact any Exchange metrics, does not prevent operating system rebooting process and successfully register record at system event log about "The Telegraf for Metrics service entered the stopped state.” But if I stop telegraf service, replace them to version 1.18.1 and start service again, process of rebooting operating system will be waiting at state "stopping services" (system event log after reboot not contain record about telegraf service was stopped). |
I see you updated your previous message ;) you previously said 1.8, not 1.18. The diff is a little different now: v1.18.0...v1.18.1 There is one addition to the agent where we attempt to call close outputs. A way you can test if this is the issue is to update your [[outputs.file]]
files = ["C:/Program Files/Telegraf/telegraf-metrics_outputs.file.txt"]
data_format = "influx" use [[outputs.file]] Update the service with that new config and give it a try. Thanks! |
Hi! I apologize for misleading you. Note: But if you click to the url of each release directly, there you can find zip-archive for windows. I double-checked everything again by performing repeated tests and can see, that my problem start between 1.18.0rc0 - quick reboot. 1.18.0rc1 - long reboot. Do you have any ideas, which reason of case can be? |
ok so here is the diff: v1.18.0-rc0...v1.18.0-rc1 Some notable changes:
Can you provide the logs from telegraf-metrics.log for both those versions? I'm curious to see the final set of messages. |
Hi!
Yes, of course! Note: 1.18.0rc0
Messages from system log:
Messages from telegraf log:
1.18.0rc1
Messages from system log:
Messages from telegraf log:
Attachments: |
Both of these messages from the telegraf binary appear in a timely fashion. Meaning telegraf got the signal to shutdown, and completed the shutdown steps successfully. For v1.18.0rc1, if I combine the logs to make it easier to see the delay: 24.02.2024 0:01:06 EventID 1074 The process C:\Windows\system32\wbem\wmiprvse.exe (UK-N2-MBX02) has initiated the restart of computer UK-N2-MBX02...
2024-02-23T17:01:09Z D! [agent] Stopping service inputs
2024-02-23T17:01:09Z D! [agent] Input channel closed
2024-02-23T17:01:09Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-02-23T17:01:09Z D! [outputs.file] Wrote batch of 11 metrics in 0s
2024-02-23T17:01:09Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-02-23T17:01:09Z D! [agent] Stopped Successfully
24.02.2024 0:11:11 EventID 7036 The Telegraf for Metrics service entered the stopped state.
24.02.2024 0:14:55 EventID 6005 The Event log service was started.
24.02.2024 0:14:56 EventID 6013 The system uptime is 66 seconds. There are no changes in the diff between those versions, other than the Go version, that I can think of that would cause some sort of delay. Given you said this only occurs with Exchange, my hunch is with the way the service is created or run or something else on the system, like a virus scanner, is causing issues for you. Questions:
|
Hi, @powersj! I'm researching this problem too.
|
Can you provide some details on your system? Windows version? Do you also have exchange installed? or any other services? How did you come to determine it was telegraf service getting hung? |
All details were provided earlier by Erikov-K. We are researching the one and the same problem together. |
Ah you didn't mention that :) My only remaining thought would be that it has to do with a change to Go itself, as between those two versions we upgraded to go v1.16. You could try building telegraf v1.18rc0 with go v1.16 and if it reproduces that would mean this is something with upstream go's service management code. Otherwise, without the ability to reproduce or any additonal logs I am out of ideas. |
I compiled two different executables using different versions of Go (1.16 and 1.15.8) from the same sources for telegraf-1.18.0-rc0,
|
Thank for trying that out. That would point at this being a change in the upstream Go library that we use to create services. The next step would be to look at what changed between those versions with respect to the service calls we make. |
Additionally,
|
Additionally,
|
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.28.5, Windows Server 2022, Exchange 2019 CU13
Docker
No response
Steps to reproduce
I have a strange issue with server hangs during operating system reboot at step "stopping services".
My environment consist of following components:
In telegraf we use input plugins "inputs.win_perf_counters", "inputs.win_services" and output plugin "outputs.prometheus_client".
When telegraf service running, server hangs during rebooting process at step "stopping services" for 10-15 minutes.
In Windows System log at these time I can see, that different services goes into stopped state.
Also I can see in telegraf log-file, that "
I! [agent] Hang on, flushing any cached metrics before shutdown
" and "I! [agent] Stopping running outputs
". But Windows System log does not contain record about stopping telegraf service.At this time if I connect to server (using Enter-PSSession) and run command
get-service telegraf
, I will see that telegraf-service state isRunning
.Then I run command
stop-service telegraf
and server will reboot immediately and at System event log I will see event-record about "The telegraf service entered the stopped state.
"But If I does not stop telegraf-service manually during server reboot process, server will be at state "stopping services" during 10-15 minutes...
State of Excahge services (running or stopped) does not affect on telegraf service state during system rebooting process.
Can you reproduce this case and give me answer, why telegraf service does-not stop their service correctly?
Expected behavior
Record saying us that telegraf service successfully stopped must exist at Windows System event log when operating system does to reboot process.
Actual behavior
When operating system receive command to reboot, telegraf log register record about termination of work, but telegraf service still continue running.
Additional info
No response
The text was updated successfully, but these errors were encountered: