Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and report slow IO requests #2371

Merged

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Jul 26, 2024

Scheduler goal is to make sure that dispatched requests complete not later than after io-latency-goal duration (which is defaulted to 1.5 times reactor CPU latency goal). In cases when disk or kernel slow down, there's flow-ratio guard that slows down the dispatch rate as well [1]. However, sometimes it may not be enough and requests delay their execution even more. Slowly executing requests are indirectly reported via toal-disk-delay metrics [2], but this metrics accumulates several requests into one counter potentially smoothing spikes by good requests. This detector is aimed at detecting individual slow requests and logging them along with some related statistics that should help to understand what's going on.

refs: #1766 [1] (eefa837) [2] (0238d25)
fixes: #1311
closes: #1609

@xemul xemul requested a review from avikivity July 26, 2024 10:46
@xemul
Copy link
Contributor Author

xemul commented Aug 6, 2024

@avikivity , please review

@avikivity
Copy link
Member

Looks good. But I'd like it to be optional and opt-in by the application, and the lower threshold configured from the application (like the CPU stall detector). Part of making it more of a library and less of a framework.

Will be useful for IO stall detector

Signed-off-by: Pavel Emelyanov <[email protected]>
The timer is currently in charge of updating flow-ratio moving average.
It's going to do more "smoothing" updates, so name it respectively.

Signed-off-by: Pavel Emelyanov <[email protected]>
Detect requests with execution delay larger than the configured
threshold, print warning into logs that includes the number of queued
and executing requests for the queue.

Other than that, include the number of polls that happened while the
request was in disk. For that, the executing request notices the
reactor::_polls count when it gets dispatched. This metrics should help
telling disk stalls from reactor misbehaviors. If the request wasn't
complete for too long due to disk/kernel problems, the polls count
difference should corresond to the observed delay: polls times latency
goal should be comparable to or greater than the request delay. Too low
polls count would mean that reactor failled to fetch the completion from
the kernel.

The threshold is doubled every time stall is detected to avoid storm of
warning. The averaging timer (introduced in previous patch) slowly
decreases the threshold back to keep catching "short" stalls. Initial
threshold is set to be 10x the io latency goal.

Signed-off-by: Pavel Emelyanov <[email protected]>
The one controls IO stall detector minimal threshold. By default it's
set to maximum thus turning the detector off.

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul xemul force-pushed the br-io-queue-stall-detector-and-report-a branch from d0b1cc4 to 7911006 Compare August 7, 2024 07:14
@xemul
Copy link
Contributor Author

xemul commented Aug 7, 2024

upd:

  • rebased (no conflicts)
  • fixed formatting of request execution delay
  • made threshold configurable

@avikivity avikivity merged commit 144aa53 into scylladb:master Aug 8, 2024
14 checks passed
xemul added a commit to scylladb/scylla-seastar that referenced this pull request Aug 21, 2024
scylladb/seastar#2371

* xemul/br-io-queue-stall-detector-and-report-a:
  reactor: Add --io-completion-notify-ms option
  io_queue: Stall detector
  io_queue: Keep local variable with request execution delay
  io_queue: Rename flow ratio timer to be more generic
  reactor: Export _polls counter (internally)
xemul added a commit to scylladb/scylla-seastar that referenced this pull request Sep 4, 2024
scylladb/seastar#2371

* xemul/br-io-queue-stall-detector-and-report-a:
  reactor: Add --io-completion-notify-ms option
  io_queue: Stall detector
  io_queue: Keep local variable with request execution delay
  io_queue: Rename flow ratio timer to be more generic
  reactor: Export _polls counter (internally)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disk stall detector
2 participants