Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: optimizing the chances of large write in copy_bidirectional and copy #6532

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Armillus
Copy link

@Armillus Armillus commented May 3, 2024

Although this PR was originally related to #6519 (comment), it aims now to improve a bit the chances of large write in io::copy and io::copy_bidirectional.

Motivation

tokio::io::copy_bidirectional and tokio::io::copy can, in some circumstances, lead to some non-optimal chances of large write when both writing and reading are pending (see here). For more information, refer to the issue linked above.

This PR aims to improve this behavior by reading more often.

Solution

Changelog

  • Try to read when the buffer is not full
  • Reset partially the buffer state as soon as data is fully written

Benchmarks

copy_mem_to_mem         time:   [131.88 µs 133.09 µs 134.48 µs]
                        change: [-4.3908% -2.6619% -1.0456%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

copy_mem_to_slow_hdd    time:   [181.26 ms 181.30 ms 181.35 ms]
                        change: [-0.0183% +0.0132% +0.0477%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

copy_chunk_to_mem       time:   [128.11 ms 128.13 ms 128.16 ms]
                        change: [-0.0458% -0.0191% +0.0075%] (p = 0.16 > 0.05)
                        No change in performance detected.

copy_chunk_to_slow_hdd  time:   [141.29 ms 141.43 ms 141.59 ms]
                        change: [-5.1731% -4.7362% -4.2937%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

@mox692 mox692 added A-tokio Area: The main tokio crate M-io Module: tokio/io labels May 4, 2024
@mox692
Copy link
Member

mox692 commented May 4, 2024

Overall performance seemed to improve a bit with the new read condition

I think this change could make the benchmark worse in cases where the poll_flush takes a long time, because now the poll_write is always executed after the flush finishes in each loop.

I tested using copy benchmark:

$ cd benches && cargo bench --bench copy


copy_mem_to_slow_hdd    time:   [767.41 ms 770.31 ms 773.42 ms]
                        change: [+294.24% +298.58% +302.79%] (p = 0.00 < 0.05)
                        Performance has regressed.

copy_chunk_to_slow_hdd  time:   [789.88 ms 792.86 ms 795.96 ms]
                        change: [+425.55% +433.04% +440.10%] (p = 0.00 < 0.05)
                        Performance has regressed.

@mox692
Copy link
Member

mox692 commented May 4, 2024

My question here is, in AsyncWrite, should poll_write not be executed while poll_flush is pending?

If so, I think this change is acceptable even if it will worsen performance, because this PR is simply trying to do right thing in that assumption.

@Armillus
Copy link
Author

Armillus commented May 4, 2024

I think this change could make the benchmark worse in cases where the poll_flush takes a long time, because now the poll_write is always executed after the flush finishes in each loop.

I tested using copy benchmark:

$ cd benches && cargo bench --bench copy


copy_mem_to_slow_hdd    time:   [767.41 ms 770.31 ms 773.42 ms]
                        change: [+294.24% +298.58% +302.79%] (p = 0.00 < 0.05)
                        Performance has regressed.

copy_chunk_to_slow_hdd  time:   [789.88 ms 792.86 ms 795.96 ms]
                        change: [+425.55% +433.04% +440.10%] (p = 0.00 < 0.05)
                        Performance has regressed.

Thank you for pointing this, I've missed this benchmark when looking for it. This huge regression was due to the fact that I tried to flush even when no flush was issued during a pending read. Hence, I've implemented a few changes that change this behavior and thus reduce this huge performance regression to something more acceptable at the cost of an additional boolean in CopyBuffer:

The flushing futures are still completed properly with those changes.

copy_mem_to_mem         time:   [159.04 µs 161.96 µs 165.48 µs]
                        change: [-2.3046% +1.7540% +5.9405%] (p = 0.40 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

copy_mem_to_slow_hdd    time:   [181.27 ms 181.42 ms 181.60 ms]
                        change: [+0.0776% +0.1623% +0.2530%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  14 (14.00%) high severe

copy_chunk_to_mem       time:   [128.06 ms 128.10 ms 128.15 ms]
                        change: [-0.1713% -0.0813% -0.0033%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

copy_chunk_to_slow_hdd  time:   [176.05 ms 176.75 ms 177.42 ms]
                        change: [+16.939% +17.675% +18.433%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

As a nice side effect, it makes the flushing condition more readable in my opinion.

@Armillus
Copy link
Author

Armillus commented May 4, 2024

My question here is, in AsyncWrite, should poll_write not be executed while poll_flush is pending?

That's my understanding, or it is implementation-defined. Either way, I think it should probably be documented if that's possible.

What is sure is that the flushing future needs to be properly polled until completion, which was not the case before. With the new changes, I think that the small performance regression is not a big deal anymore (I may be wrong though).

However, note that if we consider that poll_write can be executed while poll_flush is pending, we can likely optimize the code further with the new boolean member flush_in_progress.

@Darksonn
Copy link
Contributor

Darksonn commented May 4, 2024

I can't completely tell what's going on here, it but it sounds like your AsyncWrite is incorrectly implemented. Starting a flush with poll_flush and then changing your mind to write more data with poll_write, even though your last poll_flush returned pending, is allowed.

If your AsyncWrite doesn't support that, then you may need to have your poll_write method wait for the current flush to finish before it accepts any writes.

The rules must be this way, because otherwise AsyncWriteExt::{write,flush} would not be cancel safe.

@Armillus Armillus changed the title io: fix of incomplete futures in copy_bidirectional and copy io: optimizing the chances of large write in copy_bidirectional and copy May 5, 2024
@Armillus
Copy link
Author

Armillus commented May 5, 2024

As discussed with @Darksonn in the last comments of #6519 (comment), I've adapted this PR to optimize a bit the chances of large write in poll_copy.

The benefits are modest, but they might still be worth to keep (2 slight improvements out of 4 benchmarks, one more significant than the other):

copy_mem_to_mem         time:   [131.88 µs 133.09 µs 134.48 µs]
                        change: [-4.3908% -2.6619% -1.0456%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

copy_mem_to_slow_hdd    time:   [181.26 ms 181.30 ms 181.35 ms]
                        change: [-0.0183% +0.0132% +0.0477%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

copy_chunk_to_mem       time:   [128.11 ms 128.13 ms 128.16 ms]
                        change: [-0.0458% -0.0191% +0.0075%] (p = 0.16 > 0.05)
                        No change in performance detected.

copy_chunk_to_slow_hdd  time:   [141.29 ms 141.43 ms 141.59 ms]
                        change: [-5.1731% -4.7362% -4.2937%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

Although the gains on copy_mem_to_mem might change a bit from one benchmark to another, they're pretty consistent with copy_chunk_to_slow_hdd and are usually closer from 6%.

Copy link
Member

@mox692 mox692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@Darksonn
Copy link
Contributor

Darksonn commented May 7, 2024

I'm having a bit of a hard time understanding what's going on. Can you explain which cases your change would improve?

@Armillus
Copy link
Author

Armillus commented May 7, 2024

I'm having a bit of a hard time understanding what's going on. Can you explain which cases your change would improve?

Of course, sorry if I was unclear. I'm trying to optimize the cases where there is a lot of data to read with a reader which is faster than the writer (even by a bit).

In such cases, the writer might be pending, thus trying to perform a poll_read. When the reader itself is pending too, poll_read will not be performed before trying to write again, which can lead to a lost chance of a larger write, because the reader might have been ready to be polled at this point.

I'll give you a concrete example that I've observed during my tests. Let's imagine that our buffer is 8192 bytes long, with the current implementation of the function.

  • Our filled buffer has been reset to 0 (everything has been written previously)
  • We read 77 bytes immediately
  • We then try to write, but both poll_write and poll_read are pending
  • When entering the function poll_copy again, the reader is ready, but because the current condition does not allow it, we don't poll it and we try to write again
  • poll_write returns Poll::Ready, which means we have written only 77 bytes

The goal of this PR is to rework this workflow to achieve this instead:

  • Our filled buffer has been reset to 0 (everything has been written previously)
  • We read 77 bytes immediately
  • We then try to write, but both poll_write and poll_read are pending
  • When entering the function poll_copy again, there is some space left in the buffer, so we try to read
  • poll_read is ready this time and returns, 3500 bytes have been read
  • we try to write again
  • poll_write returns Poll::Ready, which means we have written 3577, against 77 previously

The only drawback I can think of is that with my reworked implementation, even if there is only one free byte left in the buffer, a poll_read will still be performed, which could be inefficient if the read is indeed ready. The condition could be slightly modified to overcome this potential issue, for instance by checking that the free space in the buffer is at least of a certain size.

Note that pending poll_read are exploited only when the buffer is empty to match the current behaviour. Indeed, it does not make sense to try to flush or to return from the function while we are trying to optimize chances of a large write, since we could end in a deadlock or just reduce performance instead of improving them.

Although some of the benchmarks provided with Tokio are slightly faster, I could notice a significant improvement in a project of mine. With a TcpStream as a reader, a bounded mpsc Sender as a writer (encapsulated in a lightweight abstraction) and a bit of contention, With my modifications, I could observe a better throughput by about 10%, with "only" a gigabyte of data transferred, and it seems it could be even better with more data.

I hope it's more understandable now, otherwise I'll be happy to clarify further :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-io Module: tokio/io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants