Detect and report slow IO requests #2371

xemul · 2024-07-26T10:46:15Z

Scheduler goal is to make sure that dispatched requests complete not later than after io-latency-goal duration (which is defaulted to 1.5 times reactor CPU latency goal). In cases when disk or kernel slow down, there's flow-ratio guard that slows down the dispatch rate as well [1]. However, sometimes it may not be enough and requests delay their execution even more. Slowly executing requests are indirectly reported via toal-disk-delay metrics [2], but this metrics accumulates several requests into one counter potentially smoothing spikes by good requests. This detector is aimed at detecting individual slow requests and logging them along with some related statistics that should help to understand what's going on.

refs: #1766 [1] (eefa837) [2] (0238d25)
fixes: #1311
closes: #1609

xemul · 2024-08-06T10:22:35Z

@avikivity , please review

avikivity · 2024-08-06T10:37:10Z

Looks good. But I'd like it to be optional and opt-in by the application, and the lower threshold configured from the application (like the CPU stall detector). Part of making it more of a library and less of a framework.

Will be useful for IO stall detector Signed-off-by: Pavel Emelyanov <[email protected]>

The timer is currently in charge of updating flow-ratio moving average. It's going to do more "smoothing" updates, so name it respectively. Signed-off-by: Pavel Emelyanov <[email protected]>

Signed-off-by: Pavel Emelyanov <[email protected]>

Detect requests with execution delay larger than the configured threshold, print warning into logs that includes the number of queued and executing requests for the queue. Other than that, include the number of polls that happened while the request was in disk. For that, the executing request notices the reactor::_polls count when it gets dispatched. This metrics should help telling disk stalls from reactor misbehaviors. If the request wasn't complete for too long due to disk/kernel problems, the polls count difference should corresond to the observed delay: polls times latency goal should be comparable to or greater than the request delay. Too low polls count would mean that reactor failled to fetch the completion from the kernel. The threshold is doubled every time stall is detected to avoid storm of warning. The averaging timer (introduced in previous patch) slowly decreases the threshold back to keep catching "short" stalls. Initial threshold is set to be 10x the io latency goal. Signed-off-by: Pavel Emelyanov <[email protected]>

The one controls IO stall detector minimal threshold. By default it's set to maximum thus turning the detector off. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul · 2024-08-07T07:14:49Z

upd:

rebased (no conflicts)
fixed formatting of request execution delay
made threshold configurable

scylladb/seastar#2371 * xemul/br-io-queue-stall-detector-and-report-a: reactor: Add --io-completion-notify-ms option io_queue: Stall detector io_queue: Keep local variable with request execution delay io_queue: Rename flow ratio timer to be more generic reactor: Export _polls counter (internally)

xemul requested a review from avikivity July 26, 2024 10:46

xemul added 5 commits August 7, 2024 09:29

reactor: Export _polls counter (internally)

089bb01

Will be useful for IO stall detector Signed-off-by: Pavel Emelyanov <[email protected]>

io_queue: Rename flow ratio timer to be more generic

746b08d

The timer is currently in charge of updating flow-ratio moving average. It's going to do more "smoothing" updates, so name it respectively. Signed-off-by: Pavel Emelyanov <[email protected]>

io_queue: Keep local variable with request execution delay

5030604

Signed-off-by: Pavel Emelyanov <[email protected]>

reactor: Add --io-completion-notify-ms option

7911006

The one controls IO stall detector minimal threshold. By default it's set to maximum thus turning the detector off. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul force-pushed the br-io-queue-stall-detector-and-report-a branch from d0b1cc4 to 7911006 Compare August 7, 2024 07:14

avikivity merged commit 144aa53 into scylladb:master Aug 8, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and report slow IO requests #2371

Detect and report slow IO requests #2371

xemul commented Jul 26, 2024

xemul commented Aug 6, 2024

avikivity commented Aug 6, 2024

xemul commented Aug 7, 2024

Detect and report slow IO requests #2371

Detect and report slow IO requests #2371

Conversation

xemul commented Jul 26, 2024

xemul commented Aug 6, 2024

avikivity commented Aug 6, 2024

xemul commented Aug 7, 2024