tests: Add perf test for shard_token_bucket #1724

xemul · 2023-07-06T17:57:05Z

The test checks if the token-bucket "rate" is held under various circumstances:

when shards sleep between grabbing tokens
when shards poll the t.b. frequently
when shards are disturbed with CPU hogs

So far the test shows four problems:

With few shards tokens deficiency produces zero sleep time, so the "good" user that sleeps between grabs effectively converts into a polling ("bad") user (fixed by shared_token_bucket: Fix duration_for() underflow #1722)
Sometimes replenishing rounding errors accumulate and render lower resulting rate than configured (fixed by shared_token_bucket: Make duration->tokens conversion more solid #1723)
When run with CPU hogs the individual shard's rates may differ too much (see IO capacity balancing is not well balanced #1083). E.g. the bucket configured with the rate of 100k tokens/sec, 48 shards, run 4 seconds.

"Slowest" shard vs "fastest" shards get this amount of tokens:

no hog: 6931 ... 9631
with hog: 2135 ... 29412

(sum rate is 100k with the aforementioned fixes)
With "capped-release" token bucket and token releasing by-timer with
the configured rate and hogs the resulting throughput can be as low as
30% of the configured (see IO sched native throttler is probably too throttly #1641)

Created token-bucket 1000000.0 t/s
perf_pure_context.sleeping_throughput_with_hog: 966646.1 t/s
perf_capped_context.sleeping_throughput: 838035.2 t/s
perf_capped_context.sleeping_throughput_with_hog: 317685.3 t/s

xemul · 2023-07-19T05:50:31Z

@avikivity , please review

xemul · 2023-07-29T14:19:37Z

upd:

added capped-release test that finally shows a real problem

tests/perf/shared_token_bucket.cc

xemul · 2023-07-31T10:37:48Z

upd:

removed some unused stats
less hard-coded constants
relaxed hogs not to burn CPU throughout the whole task-quota
improved releaser to work like real reap_kernel_completion -- it releases tokens proportional to real pause duration

avikivity · 2023-07-31T10:41:15Z

tests/perf/shared_token_bucket.cc

+
+PERF_TEST_F(perf_capped_context, sleeping_throughput) { return test_sleeping(); }
+PERF_TEST_F(perf_capped_context, yielding_throughput) { return test_yielding(); }
+PERF_TEST_F(perf_capped_context, sleeping_throughput_with_hog) { return test_sleeping_with_hog(); }


Please add lots of comments explaining the test scenario and documenting what a good result is (and why). I'm sufficiently detached from the internals that it's hard for me to figure it out directly.

Yup, my bad. Added

The test checks if the token-bucket "rate" is held under various circumstances: - when shards sleep between grabbing tokens - when shards poll the t.b. frequently - when shards are disturbed with CPU hogs So far the test shows four problems: - With few shards tokens deficiency produces zero sleep time, so the "good" user that sleeps between grabs effectively converts into a polling ("bad") user (fixed by scylladb#1722) - Sometimes replenishing rounding errors accumulate and render lower resulting rate than configured (fixed by scylladb#1723) - When run with CPU hogs the individual shard's rates may differ too much (see scylladb#1083). E.g. the bucket configured with the rate of 100k tokens/sec, 48 shards, run 4 seconds. "Slowest" shard vs "fastest" shards get this amount of tokens: no hog: 6931 ... 9631 with hog: 2135 ... 29412 (sum rate is 100k with the aforementioned fixes) - With "capped-release" token bucket and token releasing by-timer with the configured rate and hogs the resulting throughput can be as low as 50% of the configured (see scylladb#1641) Created token-bucket 1000000.0 t/s perf_pure_context.sleeping_throughput_with_hog: 999149.3 t/s perf_capped_context.sleeping_throughput: 859995.9 t/s perf_capped_context.sleeping_throughput_with_hog: 512912.0 t/s Signed-off-by: Pavel Emelyanov <[email protected]>

…m Pavel Emelyanov There are three places where IO dispatch loop is throttled * self-throttling with token bucket according to math model * per-shard one-tick threshold * 2-bucket approach when tokens are replenished only after they are released from disk This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect. Here's some math. Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued. In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d. Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall. The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again. The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value. Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate. The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls. This can be reproduced by the test introduced in #1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it. fixes: #1641 refs: #1311 refs: #1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle refs: #1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla Closes #1766 * github.com:scylladb/seastar: doc: Add document describing all the math behind IO scheduler io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()

…m Pavel Emelyanov There are three places where IO dispatch loop is throttled * self-throttling with token bucket according to math model * per-shard one-tick threshold * 2-bucket approach when tokens are replenished only after they are released from disk This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect. Here's some math. Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued. In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d. Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall. The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again. The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value. Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate. The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls. This can be reproduced by the test introduced in scylladb#1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it. fixes: scylladb#1641 refs: scylladb#1311 refs: scylladb#1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle refs: scylladb#1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla Closes scylladb#1766 * github.com:scylladb/seastar: doc: Add document describing all the math behind IO scheduler io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()

xemul requested a review from avikivity July 6, 2023 17:57

xemul force-pushed the br-shared-token-bucket-test branch from 22b9f9f to 311704c Compare July 29, 2023 14:18

xemul mentioned this pull request Jul 29, 2023

IO sched native throttler is probably too throttly #1641

Closed

tchaikov reviewed Jul 29, 2023

View reviewed changes

tests/perf/shared_token_bucket.cc Outdated Show resolved Hide resolved

xemul force-pushed the br-shared-token-bucket-test branch 2 times, most recently from 8768aba to 252c349 Compare July 31, 2023 10:36

avikivity reviewed Jul 31, 2023

View reviewed changes

xemul force-pushed the br-shared-token-bucket-test branch from 252c349 to bdefdc5 Compare July 31, 2023 11:40

xemul mentioned this pull request Aug 2, 2023

Slowdown IO scheduler based on dispatched/completed ratio #1766

Merged

avikivity merged commit 7051efb into scylladb:master Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Add perf test for shard_token_bucket #1724

tests: Add perf test for shard_token_bucket #1724

xemul commented Jul 6, 2023 •

edited

Loading

xemul commented Jul 19, 2023

xemul commented Jul 29, 2023

xemul commented Jul 31, 2023

avikivity Jul 31, 2023

xemul Jul 31, 2023

tests: Add perf test for shard_token_bucket #1724

tests: Add perf test for shard_token_bucket #1724

Conversation

xemul commented Jul 6, 2023 • edited Loading

xemul commented Jul 19, 2023

xemul commented Jul 29, 2023

xemul commented Jul 31, 2023

avikivity Jul 31, 2023

Choose a reason for hiding this comment

xemul Jul 31, 2023

Choose a reason for hiding this comment

xemul commented Jul 6, 2023 •

edited

Loading