-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and report slow IO requests #2371
Merged
avikivity
merged 5 commits into
scylladb:master
from
xemul:br-io-queue-stall-detector-and-report-a
Aug 8, 2024
Merged
Detect and report slow IO requests #2371
avikivity
merged 5 commits into
scylladb:master
from
xemul:br-io-queue-stall-detector-and-report-a
Aug 8, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@avikivity , please review |
Looks good. But I'd like it to be optional and opt-in by the application, and the lower threshold configured from the application (like the CPU stall detector). Part of making it more of a library and less of a framework. |
Will be useful for IO stall detector Signed-off-by: Pavel Emelyanov <[email protected]>
The timer is currently in charge of updating flow-ratio moving average. It's going to do more "smoothing" updates, so name it respectively. Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Pavel Emelyanov <[email protected]>
Detect requests with execution delay larger than the configured threshold, print warning into logs that includes the number of queued and executing requests for the queue. Other than that, include the number of polls that happened while the request was in disk. For that, the executing request notices the reactor::_polls count when it gets dispatched. This metrics should help telling disk stalls from reactor misbehaviors. If the request wasn't complete for too long due to disk/kernel problems, the polls count difference should corresond to the observed delay: polls times latency goal should be comparable to or greater than the request delay. Too low polls count would mean that reactor failled to fetch the completion from the kernel. The threshold is doubled every time stall is detected to avoid storm of warning. The averaging timer (introduced in previous patch) slowly decreases the threshold back to keep catching "short" stalls. Initial threshold is set to be 10x the io latency goal. Signed-off-by: Pavel Emelyanov <[email protected]>
The one controls IO stall detector minimal threshold. By default it's set to maximum thus turning the detector off. Signed-off-by: Pavel Emelyanov <[email protected]>
xemul
force-pushed
the
br-io-queue-stall-detector-and-report-a
branch
from
August 7, 2024 07:14
d0b1cc4
to
7911006
Compare
upd:
|
xemul
added a commit
to scylladb/scylla-seastar
that referenced
this pull request
Aug 21, 2024
scylladb/seastar#2371 * xemul/br-io-queue-stall-detector-and-report-a: reactor: Add --io-completion-notify-ms option io_queue: Stall detector io_queue: Keep local variable with request execution delay io_queue: Rename flow ratio timer to be more generic reactor: Export _polls counter (internally)
xemul
added a commit
to scylladb/scylla-seastar
that referenced
this pull request
Sep 4, 2024
scylladb/seastar#2371 * xemul/br-io-queue-stall-detector-and-report-a: reactor: Add --io-completion-notify-ms option io_queue: Stall detector io_queue: Keep local variable with request execution delay io_queue: Rename flow ratio timer to be more generic reactor: Export _polls counter (internally)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Scheduler goal is to make sure that dispatched requests complete not later than after io-latency-goal duration (which is defaulted to 1.5 times reactor CPU latency goal). In cases when disk or kernel slow down, there's flow-ratio guard that slows down the dispatch rate as well [1]. However, sometimes it may not be enough and requests delay their execution even more. Slowly executing requests are indirectly reported via toal-disk-delay metrics [2], but this metrics accumulates several requests into one counter potentially smoothing spikes by good requests. This detector is aimed at detecting individual slow requests and logging them along with some related statistics that should help to understand what's going on.
refs: #1766 [1] (eefa837) [2] (0238d25)
fixes: #1311
closes: #1609