io: optimizing the chances of large write in copy_bidirectional and copy #6532

Armillus · 2024-05-03T20:58:26Z

Although this PR was originally related to #6519 (comment), it aims now to improve a bit the chances of large write in io::copy and io::copy_bidirectional.

Motivation

tokio::io::copy_bidirectional and tokio::io::copy can, in some circumstances, lead to some non-optimal chances of large write when both writing and reading are pending (see here). For more information, refer to the issue linked above.

This PR aims to improve this behavior by reading more often.

Solution

Changelog

Try to read when the buffer is not full
Reset partially the buffer state as soon as data is fully written

Benchmarks

copy_mem_to_mem         time:   [131.88 µs 133.09 µs 134.48 µs]
                        change: [-4.3908% -2.6619% -1.0456%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

copy_mem_to_slow_hdd    time:   [181.26 ms 181.30 ms 181.35 ms]
                        change: [-0.0183% +0.0132% +0.0477%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

copy_chunk_to_mem       time:   [128.11 ms 128.13 ms 128.16 ms]
                        change: [-0.0458% -0.0191% +0.0075%] (p = 0.16 > 0.05)
                        No change in performance detected.

copy_chunk_to_slow_hdd  time:   [141.29 ms 141.43 ms 141.59 ms]
                        change: [-5.1731% -4.7362% -4.2937%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mox692 · 2024-05-04T07:18:47Z

Overall performance seemed to improve a bit with the new read condition

I think this change could make the benchmark worse in cases where the poll_flush takes a long time, because now the poll_write is always executed after the flush finishes in each loop.

I tested using copy benchmark:

$ cd benches && cargo bench --bench copy


copy_mem_to_slow_hdd    time:   [767.41 ms 770.31 ms 773.42 ms]
                        change: [+294.24% +298.58% +302.79%] (p = 0.00 < 0.05)
                        Performance has regressed.

copy_chunk_to_slow_hdd  time:   [789.88 ms 792.86 ms 795.96 ms]
                        change: [+425.55% +433.04% +440.10%] (p = 0.00 < 0.05)
                        Performance has regressed.

mox692 · 2024-05-04T07:23:46Z

My question here is, in AsyncWrite, should poll_write not be executed while poll_flush is pending?

If so, I think this change is acceptable even if it will worsen performance, because this PR is simply trying to do right thing in that assumption.

Armillus · 2024-05-04T08:48:22Z

I think this change could make the benchmark worse in cases where the poll_flush takes a long time, because now the poll_write is always executed after the flush finishes in each loop.

I tested using copy benchmark:
$ cd benches && cargo bench --bench copy


copy_mem_to_slow_hdd    time:   [767.41 ms 770.31 ms 773.42 ms]
                        change: [+294.24% +298.58% +302.79%] (p = 0.00 < 0.05)
                        Performance has regressed.

copy_chunk_to_slow_hdd  time:   [789.88 ms 792.86 ms 795.96 ms]
                        change: [+425.55% +433.04% +440.10%] (p = 0.00 < 0.05)
                        Performance has regressed.

Thank you for pointing this, I've missed this benchmark when looking for it. This huge regression was due to the fact that I tried to flush even when no flush was issued during a pending read. Hence, I've implemented a few changes that change this behavior and thus reduce this huge performance regression to something more acceptable at the cost of an additional boolean in CopyBuffer:

The flushing futures are still completed properly with those changes.

copy_mem_to_mem         time:   [159.04 µs 161.96 µs 165.48 µs]
                        change: [-2.3046% +1.7540% +5.9405%] (p = 0.40 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

copy_mem_to_slow_hdd    time:   [181.27 ms 181.42 ms 181.60 ms]
                        change: [+0.0776% +0.1623% +0.2530%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  14 (14.00%) high severe

copy_chunk_to_mem       time:   [128.06 ms 128.10 ms 128.15 ms]
                        change: [-0.1713% -0.0813% -0.0033%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

copy_chunk_to_slow_hdd  time:   [176.05 ms 176.75 ms 177.42 ms]
                        change: [+16.939% +17.675% +18.433%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

As a nice side effect, it makes the flushing condition more readable in my opinion.

Armillus · 2024-05-04T08:55:43Z

My question here is, in AsyncWrite, should poll_write not be executed while poll_flush is pending?

That's my understanding, or it is implementation-defined. Either way, I think it should probably be documented if that's possible.

What is sure is that the flushing future needs to be properly polled until completion, which was not the case before. With the new changes, I think that the small performance regression is not a big deal anymore (I may be wrong though).

However, note that if we consider that poll_write can be executed while poll_flush is pending, we can likely optimize the code further with the new boolean member flush_in_progress.

Darksonn · 2024-05-04T15:46:13Z

I can't completely tell what's going on here, it but it sounds like your AsyncWrite is incorrectly implemented. Starting a flush with poll_flush and then changing your mind to write more data with poll_write, even though your last poll_flush returned pending, is allowed.

If your AsyncWrite doesn't support that, then you may need to have your poll_write method wait for the current flush to finish before it accepts any writes.

The rules must be this way, because otherwise AsyncWriteExt::{write,flush} would not be cancel safe.

Armillus · 2024-05-05T21:18:09Z

As discussed with @Darksonn in the last comments of #6519 (comment), I've adapted this PR to optimize a bit the chances of large write in poll_copy.

The benefits are modest, but they might still be worth to keep (2 slight improvements out of 4 benchmarks, one more significant than the other):

copy_mem_to_mem         time:   [131.88 µs 133.09 µs 134.48 µs]
                        change: [-4.3908% -2.6619% -1.0456%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

copy_mem_to_slow_hdd    time:   [181.26 ms 181.30 ms 181.35 ms]
                        change: [-0.0183% +0.0132% +0.0477%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

copy_chunk_to_mem       time:   [128.11 ms 128.13 ms 128.16 ms]
                        change: [-0.0458% -0.0191% +0.0075%] (p = 0.16 > 0.05)
                        No change in performance detected.

copy_chunk_to_slow_hdd  time:   [141.29 ms 141.43 ms 141.59 ms]
                        change: [-5.1731% -4.7362% -4.2937%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

Although the gains on copy_mem_to_mem might change a bit from one benchmark to another, they're pretty consistent with copy_chunk_to_slow_hdd and are usually closer from 6%.

mox692

Looks good to me.

Darksonn · 2024-05-07T15:00:43Z

I'm having a bit of a hard time understanding what's going on. Can you explain which cases your change would improve?

Armillus · 2024-05-07T15:41:08Z

I'm having a bit of a hard time understanding what's going on. Can you explain which cases your change would improve?

Of course, sorry if I was unclear. I'm trying to optimize the cases where there is a lot of data to read with a reader which is faster than the writer (even by a bit).

In such cases, the writer might be pending, thus trying to perform a poll_read. When the reader itself is pending too, poll_read will not be performed before trying to write again, which can lead to a lost chance of a larger write, because the reader might have been ready to be polled at this point.

I'll give you a concrete example that I've observed during my tests. Let's imagine that our buffer is 8192 bytes long, with the current implementation of the function.

Our filled buffer has been reset to 0 (everything has been written previously)
We read 77 bytes immediately
We then try to write, but both poll_write and poll_read are pending
When entering the function poll_copy again, the reader is ready, but because the current condition does not allow it, we don't poll it and we try to write again
poll_write returns Poll::Ready, which means we have written only 77 bytes

The goal of this PR is to rework this workflow to achieve this instead:

Our filled buffer has been reset to 0 (everything has been written previously)
We read 77 bytes immediately
We then try to write, but both poll_write and poll_read are pending
When entering the function poll_copy again, there is some space left in the buffer, so we try to read
poll_read is ready this time and returns, 3500 bytes have been read
we try to write again
poll_write returns Poll::Ready, which means we have written 3577, against 77 previously

The only drawback I can think of is that with my reworked implementation, even if there is only one free byte left in the buffer, a poll_read will still be performed, which could be inefficient if the read is indeed ready. The condition could be slightly modified to overcome this potential issue, for instance by checking that the free space in the buffer is at least of a certain size.

Note that pending poll_read are exploited only when the buffer is empty to match the current behaviour. Indeed, it does not make sense to try to flush or to return from the function while we are trying to optimize chances of a large write, since we could end in a deadlock or just reduce performance instead of improving them.

Although some of the benchmarks provided with Tokio are slightly faster, I could notice a significant improvement in a project of mine. With a TcpStream as a reader, a bounded mpsc Sender as a writer (encapsulated in a lightweight abstraction) and a bit of contention, With my modifications, I could observe a better throughput by about 10%, with "only" a gigabyte of data transferred, and it seems it could be even better with more data.

I hope it's more understandable now, otherwise I'll be happy to clarify further :)

Armillus added 5 commits May 3, 2024 21:32

io: fix non-terminated flushing future in CopyBuffer

5386820

io: fix non-terminated reading future in CopyBuffer

a61b89b

io: refactor writer's flushing in CopyBuffer

3e3976a

io: fix constraints indentation in CopyBuffer methods

4c29e2f

io: fix more constraints indentation in CopyBuffer

099d9d0

mox692 added A-tokio Area: The main tokio crate M-io Module: tokio/io labels May 4, 2024

io: optimize flushing in CopyBuffer

fa05e0c

Armillus added 4 commits May 5, 2024 22:38

io: optimize chances of large write in CopyBuffer

676fa45

io: fix formatting in CopyBuffer

c43ed10

io: fix comment position in CopyBuffer

2615478

io: remove useless condition in CopyBuffer poll loop

d8fc3b5

Armillus changed the title ~~io: fix of incomplete futures in copy_bidirectional and copy~~ io: optimizing the chances of large write in copy_bidirectional and copy May 5, 2024

mox692 approved these changes May 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io: optimizing the chances of large write in copy_bidirectional and copy #6532

io: optimizing the chances of large write in copy_bidirectional and copy #6532

Armillus commented May 3, 2024 •

edited

mox692 commented May 4, 2024 •

edited

mox692 commented May 4, 2024

Armillus commented May 4, 2024 •

edited

Armillus commented May 4, 2024 •

edited

Darksonn commented May 4, 2024 •

edited

Armillus commented May 5, 2024 •

edited

mox692 left a comment

Darksonn commented May 7, 2024

Armillus commented May 7, 2024

io: optimizing the chances of large write in copy_bidirectional and copy #6532

Are you sure you want to change the base?

io: optimizing the chances of large write in copy_bidirectional and copy #6532

Conversation

Armillus commented May 3, 2024 • edited

Motivation

Solution

Changelog

Benchmarks

mox692 commented May 4, 2024 • edited

mox692 commented May 4, 2024

Armillus commented May 4, 2024 • edited

Armillus commented May 4, 2024 • edited

Darksonn commented May 4, 2024 • edited

Armillus commented May 5, 2024 • edited

mox692 left a comment

Choose a reason for hiding this comment

Darksonn commented May 7, 2024

Armillus commented May 7, 2024

Armillus commented May 3, 2024 •

edited

mox692 commented May 4, 2024 •

edited

Armillus commented May 4, 2024 •

edited

Armillus commented May 4, 2024 •

edited

Darksonn commented May 4, 2024 •

edited

Armillus commented May 5, 2024 •

edited