Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO capacity balancing is not well balanced #1083

Open
xemul opened this issue Jun 1, 2022 · 10 comments · May be fixed by #2294
Open

IO capacity balancing is not well balanced #1083

xemul opened this issue Jun 1, 2022 · 10 comments · May be fixed by #2294
Assignees
Labels

Comments

@xemul
Copy link
Contributor

xemul commented Jun 1, 2022

As seen on i3.large node (2 cores) in scylladb/scylladb#10704

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Feb 6, 2023

@mykaul @avikivity this is "field-urgent".

In gist, what happens is that I/O queue on a specific shard ("shard A") can get long due to a temporary overload of that shard (something heavy was going on, like a memtable flush or a stall) but even after that overload is gone if other shards also do I/O, e.g. do compactions, that long long I/O queue on shard A would remain long all that time because a new I/O scheduler is going to allocate I/O an equal amount of I/O budget to every shard that needs to do I/O at the moment.

As a result that long I/O queue is going to be causing high I/O queue latency which translates into high read latency.

This is a regression compared to the old I/O scheduler (2021.1) behavior in the same situation.

This means that a new I/O scheduler solve some problems but created new ones.

We need to prioritize the fix for this issue at the highest.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Feb 6, 2023

@tomer-sandler @dorlaor @harel-z FYI

@scylladb scylladb deleted a comment from tomer-sandler Feb 7, 2023
@avikivity
Copy link
Member

(deleted comment mentioning customers)

avikivity added a commit that referenced this issue Feb 28, 2023
The jobs from io-tester's config all live in their own sched groups and io classes. This is not very flexible, add the ability to share sched classes as it was done for the RPC tester in 4d0ddc4 (rpc_tester: Allow sharing sched groups)

refs: #1083

Closes #1481

* github.com:scylladb/seastar:
  io_tester: Add option to share classes between jobs
  io_tester: Post-assign sched classes
  io_tester: Register io-class early
@dorlaor
Copy link
Contributor

dorlaor commented Feb 28, 2023

How complex is to fix this?

@xemul
Copy link
Contributor Author

xemul commented Feb 28, 2023

I've a patch that has two problems

  • it's not confirmed on anything but a io-tester-based reproducer
  • it has a "configuration parameter" that should be somehow configured by hand and there's no good ideas (yet) how to select it automatically

@avikivity
Copy link
Member

Please post it, we can use it as a base for brainstorming.

xemul added a commit to xemul/seastar that referenced this issue Feb 28, 2023
The natural lack of cross-shard fairness may lead to a nasty imbalance
problem. When a shard gets lots of requests queued (a spike) it will
try to drain its queue by dispatching requests on every tick. However,
if all other shards have something to do so that the disk capacity is
close to be exhausted, this overloaded shard will have little chance to
drain itself because every tick it will only get its "fair" amount of
capacity tokens, which is capacity/smp::count and that's it.

In order to drain the overloaded queue a shard should get more capacity
tokens than other shards. This will increase the pressure on other
shards, of course, "spreading" one shard queue among others thus
reducing the average latency of requests. When increasing the amount of
grabbed tokens there are two pitfals to avoid.

Both come from the fact that under described curcumstances shared
capacity is likely all exhausted and shards are "fighting" for tokens in
the "pending" state -- i.e. when they line up in the shared token bucket
for _future_ tokens, that will get there eventually as requests
complete. So...

1. If the capacity is all claimed by shards and shards continue to claim
   more, they will end-up in the "pending" state, which is -- they grab
   extra tokens from the shared capacity and "remember" their position
   in the shared queue when they are to get it. Thus, if an urgent
   request arrives at random shard in the worst case it will have to
   wait for this whole over-claimed line before it can get dispatched.
   Currently, the maximum length of the over-claimed queue is limited by
   one request per shard, which eventually equals to the
   io-latency-goal. If claiming _more_ than that, this would violate
   this time by the amount of over-claimed tokens, so it shouldn't be
   too large.

2. When increasing the pressure on the shared capacity, a shard has no
   idea if any other shard does the same. This means, that shard should
   try to avoid increasing the pressure "just because", there should be
   some yes-no reason for doing it, so that only "overloaded" shards try
   to grab more. If all shards suddenly get into this aggressive state,
   they will compensate each other, but according to p.1 the worst-case
   preemption latency would grow too high.

With the above two assumptions at hands, the proposed solution is to

a. Over-claim at most one (1) request from the local queue
b. Start over-claim once the local queue length goes above some
   threshold, and apply hysteresis on exisiting this state to avoid
   resonance.

The thresholds are pretty-much random in this patch -- 12 and 8 -- and
that's the biggest problem of it.

The issue can be reproduced with the help of recent io-tester over a
/dev/null storage :)

The io-properties.yaml:
```
disks:
  - mountpoint: /dev/null
    read_iops: 1200
    read_bandwidth: 1GB
    write_iops: 1200
    write_bandwidth: 1GB
```

The jobs conf.yaml:
```
- name: latency_reads_1
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 80
    rps: 1
    reqsize: 512
    shares: 1000

- name: latency_reads_1a
  shards: [0]
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 10
    limit: 100
    reqsize: 512
    class: latency_reads_1
```

Running it with 1 io group and 12 shards would result in shard 0
suffering from not-draining-ever queue and huge final latencies:

    shard p99 latency (usec)
       0: 1208561
       1: 14520
       2: 17456
       3: 15777
       4: 15488
       5: 14576
       6: 19251
       7: 20222
       8: 18338
       9: 21267
      10: 17083
      11: 16188

With this patch applied shard-0 would scatter its queue among other
shards within several ticks lowering its latency at the cost of other
shards's latencies:

    shard p99 latency (usec)
       0: 108345
       1: 102907
       2: 106900
       3: 105244
       4: 109214
       5: 107881
       6: 114278
       7: 114289
       8: 113560
       9: 105411
      10: 113898
      11: 112615

However, the larger the testing time, the smaller latencies become for
the 2nd test (and for the 1st too, but not for shard-0)

refs: scylladb#1083

Signed-off-by: Pavel Emelyanov <[email protected]>
xemul added a commit to xemul/seastar that referenced this issue Mar 1, 2023
The natural lack of cross-shard fairness may lead to a nasty imbalance
problem. When a shard gets lots of requests queued (a spike) it will
try to drain its queue by dispatching requests on every tick. However,
if all other shards have something to do so that the disk capacity is
close to be exhausted, this overloaded shard will have little chance to
drain itself because every tick it will only get its "fair" amount of
capacity tokens, which is capacity/smp::count and that's it.

In order to drain the overloaded queue a shard should get more capacity
tokens than other shards. This will increase the pressure on other
shards, of course, "spreading" one shard queue among others thus
reducing the average latency of requests. When increasing the amount of
grabbed tokens there are two pitfals to avoid.

Both come from the fact that under described curcumstances shared
capacity is likely all exhausted and shards are "fighting" for tokens in
the "pending" state -- i.e. when they line up in the shared token bucket
for _future_ tokens, that will get there eventually as requests
complete. So...

1. If the capacity is all claimed by shards and shards continue to claim
   more, they will end-up in the "pending" state, which is -- they grab
   extra tokens from the shared capacity and "remember" their position
   in the shared queue when they are to get it. Thus, if an urgent
   request arrives at random shard in the worst case it will have to
   wait for this whole over-claimed line before it can get dispatched.
   Currently, the maximum length of the over-claimed queue is limited by
   one request per shard, which eventually equals to the
   io-latency-goal. If claiming _more_ than that, this would violate
   this time by the amount of over-claimed tokens, so it shouldn't be
   too large.

2. When increasing the pressure on the shared capacity, a shard has no
   idea if any other shard does the same. This means, that shard should
   try to avoid increasing the pressure "just because", there should be
   some yes-no reason for doing it, so that only "overloaded" shards try
   to grab more. If all shards suddenly get into this aggressive state,
   they will compensate each other, but according to p.1 the worst-case
   preemption latency would grow too high.

With the above two assumptions at hands, the proposed solution is to
introduce per-class capacity-claim measure which grows monotonically
with the class queue length and is proportional to class shares.

a. Over-claim at most one (1) request from the local queue

b. Start over-claim once the capacity claim goes above some threshold,
   and apply hysteresis on exisiting this state to avoid resonance

The capacity claim is deliberately selected to grow faster for high-prio
queues with short requests (scylla query class) and grow much slower for
low-prio queues with fat requests (scylla compaction/flush classes). So
it doesn't care about requests lengths, but depends on shares value.

Also, since several classes may fluctuate around claim thresholds, the
oversubscribing happens when there's at least one of that kind.

The thresholds are pretty-much random in this patch -- 12000 and 8000 --
and that's the biggest problem of it.

The issue can be reproduced with the help of recent io-tester over a
/dev/null storage :)

The io-properties.yaml:
```
disks:
  - mountpoint: /dev/null
    read_iops: 1200
    read_bandwidth: 1GB
    write_iops: 1200
    write_bandwidth: 1GB
```

The jobs conf.yaml:
```
- name: latency_reads_1
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 80
    rps: 1
    reqsize: 512
    shares: 1000

- name: latency_reads_1a
  shards: [0]
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 10
    limit: 100
    reqsize: 512
    class: latency_reads_1
```

Running it with 1 io group and 12 shards would result in shard 0
suffering from not-draining-ever queue and huge final latencies:

    shard p99 latency (usec)
       0: 1208561
       1: 14520
       2: 17456
       3: 15777
       4: 15488
       5: 14576
       6: 19251
       7: 20222
       8: 18338
       9: 21267
      10: 17083
      11: 16188

With this patch applied shard-0 would scatter its queue among other
shards within several ticks lowering its latency at the cost of other
shards's latencies:

    shard p99 latency (usec)
       0: 108345
       1: 102907
       2: 106900
       3: 105244
       4: 109214
       5: 107881
       6: 114278
       7: 114289
       8: 113560
       9: 105411
      10: 113898
      11: 112615

However, the larger the testing time, the smaller latencies become for
the 2nd test (and for the 1st too, but not for shard-0)

refs: scylladb#1083

Signed-off-by: Pavel Emelyanov <[email protected]>
xemul added a commit to xemul/seastar that referenced this issue Jul 6, 2023
The test checks if the token-bucket "rate" is held under various
circumstances:

- when shards sleep between grabbing tokens
- when shards poll the t.b. frequently
- when shards are disturbed with CPU hogs

So far the test shows three problems:

- With few shards tokens deficiency produces zero sleep time, so the
  "good" user that sleeps between grabs effectively converts into a
  polling ("bad") user (fixed by scylladb#1722)

- Sometimes replenishing rounding errors accumulate and render lower
  resulting rate than configured (fixed by scylladb#1723)

- When run with CPU hogs the individual shard's rates may differ too
  much (see scylladb#1083). E.g. the bucket configured with the rate of 100k
  tokens/sec, 48 shards, run 4 seconds.

  "Slowest" shard vs "fastest" shards get this amount of tokens:

    no hog:   6931 ... 9631
    with hog: 2135 ... 29412

  (sum rate is 100k with the aforementioned fixes)

Signed-off-by: Pavel Emelyanov <[email protected]>
xemul added a commit to xemul/seastar that referenced this issue Jul 29, 2023
The test checks if the token-bucket "rate" is held under various
circumstances:

- when shards sleep between grabbing tokens
- when shards poll the t.b. frequently
- when shards are disturbed with CPU hogs

So far the test shows four problems:

- With few shards tokens deficiency produces zero sleep time, so the
  "good" user that sleeps between grabs effectively converts into a
  polling ("bad") user (fixed by scylladb#1722)

- Sometimes replenishing rounding errors accumulate and render lower
  resulting rate than configured (fixed by scylladb#1723)

- When run with CPU hogs the individual shard's rates may differ too
  much (see scylladb#1083). E.g. the bucket configured with the rate of 100k
  tokens/sec, 48 shards, run 4 seconds.

  "Slowest" shard vs "fastest" shards get this amount of tokens:

    no hog:   6931 ... 9631
    with hog: 2135 ... 29412

  (sum rate is 100k with the aforementioned fixes)

- With "capped-release" token bucket and token releasing by-timer with
  the configured rate and hogs the resulting throughput can be as low as
  30% of the configured (see scylladb#1641)

  Created token-bucket 1000000.0 t/s
  perf_pure_context.sleeping_throughput_with_hog:   966646.1 t/s
  perf_capped_context.sleeping_throughput:          838035.2 t/s
  perf_pure_context.sleeping_throughput_with_hog:   317685.3 t/s

Signed-off-by: Pavel Emelyanov <[email protected]>
xemul added a commit to xemul/seastar that referenced this issue Jul 29, 2023
The test checks if the token-bucket "rate" is held under various
circumstances:

- when shards sleep between grabbing tokens
- when shards poll the t.b. frequently
- when shards are disturbed with CPU hogs

So far the test shows four problems:

- With few shards tokens deficiency produces zero sleep time, so the
  "good" user that sleeps between grabs effectively converts into a
  polling ("bad") user (fixed by scylladb#1722)

- Sometimes replenishing rounding errors accumulate and render lower
  resulting rate than configured (fixed by scylladb#1723)

- When run with CPU hogs the individual shard's rates may differ too
  much (see scylladb#1083). E.g. the bucket configured with the rate of 100k
  tokens/sec, 48 shards, run 4 seconds.

  "Slowest" shard vs "fastest" shards get this amount of tokens:

    no hog:   6931 ... 9631
    with hog: 2135 ... 29412

  (sum rate is 100k with the aforementioned fixes)

- With "capped-release" token bucket and token releasing by-timer with
  the configured rate and hogs the resulting throughput can be as low as
  30% of the configured (see scylladb#1641)

  Created token-bucket 1000000.0 t/s
  perf_pure_context.sleeping_throughput_with_hog:   966646.1 t/s
  perf_capped_context.sleeping_throughput:          838035.2 t/s
  perf_capped_context.sleeping_throughput_with_hog: 317685.3 t/s

Signed-off-by: Pavel Emelyanov <[email protected]>
xemul added a commit to xemul/seastar that referenced this issue Jul 31, 2023
The test checks if the token-bucket "rate" is held under various
circumstances:

- when shards sleep between grabbing tokens
- when shards poll the t.b. frequently
- when shards are disturbed with CPU hogs

So far the test shows four problems:

- With few shards tokens deficiency produces zero sleep time, so the
  "good" user that sleeps between grabs effectively converts into a
  polling ("bad") user (fixed by scylladb#1722)

- Sometimes replenishing rounding errors accumulate and render lower
  resulting rate than configured (fixed by scylladb#1723)

- When run with CPU hogs the individual shard's rates may differ too
  much (see scylladb#1083). E.g. the bucket configured with the rate of 100k
  tokens/sec, 48 shards, run 4 seconds.

  "Slowest" shard vs "fastest" shards get this amount of tokens:

    no hog:   6931 ... 9631
    with hog: 2135 ... 29412

  (sum rate is 100k with the aforementioned fixes)

- With "capped-release" token bucket and token releasing by-timer with
  the configured rate and hogs the resulting throughput can be as low as
  50% of the configured (see scylladb#1641)

  Created token-bucket 1000000.0 t/s
  perf_pure_context.sleeping_throughput_with_hog:   999149.3 t/s
  perf_capped_context.sleeping_throughput:          859995.9 t/s
  perf_capped_context.sleeping_throughput_with_hog: 512912.0 t/s

Signed-off-by: Pavel Emelyanov <[email protected]>
xemul added a commit to xemul/seastar that referenced this issue Jul 31, 2023
The test checks if the token-bucket "rate" is held under various
circumstances:

- when shards sleep between grabbing tokens
- when shards poll the t.b. frequently
- when shards are disturbed with CPU hogs

So far the test shows four problems:

- With few shards tokens deficiency produces zero sleep time, so the
  "good" user that sleeps between grabs effectively converts into a
  polling ("bad") user (fixed by scylladb#1722)

- Sometimes replenishing rounding errors accumulate and render lower
  resulting rate than configured (fixed by scylladb#1723)

- When run with CPU hogs the individual shard's rates may differ too
  much (see scylladb#1083). E.g. the bucket configured with the rate of 100k
  tokens/sec, 48 shards, run 4 seconds.

  "Slowest" shard vs "fastest" shards get this amount of tokens:

    no hog:   6931 ... 9631
    with hog: 2135 ... 29412

  (sum rate is 100k with the aforementioned fixes)

- With "capped-release" token bucket and token releasing by-timer with
  the configured rate and hogs the resulting throughput can be as low as
  50% of the configured (see scylladb#1641)

  Created token-bucket 1000000.0 t/s
  perf_pure_context.sleeping_throughput_with_hog:   999149.3 t/s
  perf_capped_context.sleeping_throughput:          859995.9 t/s
  perf_capped_context.sleeping_throughput_with_hog: 512912.0 t/s

Signed-off-by: Pavel Emelyanov <[email protected]>
avikivity pushed a commit that referenced this issue Aug 7, 2023
The test checks if the token-bucket "rate" is held under various
circumstances:

- when shards sleep between grabbing tokens
- when shards poll the t.b. frequently
- when shards are disturbed with CPU hogs

So far the test shows four problems:

- With few shards tokens deficiency produces zero sleep time, so the
  "good" user that sleeps between grabs effectively converts into a
  polling ("bad") user (fixed by #1722)

- Sometimes replenishing rounding errors accumulate and render lower
  resulting rate than configured (fixed by #1723)

- When run with CPU hogs the individual shard's rates may differ too
  much (see #1083). E.g. the bucket configured with the rate of 100k
  tokens/sec, 48 shards, run 4 seconds.

  "Slowest" shard vs "fastest" shards get this amount of tokens:

    no hog:   6931 ... 9631
    with hog: 2135 ... 29412

  (sum rate is 100k with the aforementioned fixes)

- With "capped-release" token bucket and token releasing by-timer with
  the configured rate and hogs the resulting throughput can be as low as
  50% of the configured (see #1641)

  Created token-bucket 1000000.0 t/s
  perf_pure_context.sleeping_throughput_with_hog:   999149.3 t/s
  perf_capped_context.sleeping_throughput:          859995.9 t/s
  perf_capped_context.sleeping_throughput_with_hog: 512912.0 t/s

Signed-off-by: Pavel Emelyanov <[email protected]>
@mykaul
Copy link

mykaul commented Oct 22, 2023

What's the latest status of this issue? (seeing if it'll make it to Scylla 5.4)

@xemul
Copy link
Contributor Author

xemul commented Oct 23, 2023

It's in lower prio, because there's a "workaround" -- one need to configure more io-groups than it's auto-detected by seastar to make groups' size smaller and thus reduce the per-group imbalance. Avi thinks it should be the default behavior.

@bhalevy
Copy link
Member

bhalevy commented Nov 5, 2023

It's in lower prio, because there's a "workaround" -- one need to configure more io-groups than it's auto-detected by seastar to make groups' size smaller and thus reduce the per-group imbalance. Avi thinks it should be the default behavior.

@xemul so are we making this the default behavior?

@xemul
Copy link
Contributor Author

xemul commented Nov 6, 2023

@xemul so are we making this the default behavior?

I lean towards it, but I've no good ideas how to calculate/estimate which amount of shards in a group is good enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants