Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentmap refactorisation #314

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

AloysAugustin
Copy link

Hi Kazuho,

I looked at the performance of Quicly yesterday and noticed a bottleneck in the sentmap at high throughputs. This PR proposes a new structure based on a dlist of packets for the sentmap, which improves the throughput by 5-7% in our tests.

@kazuho
Copy link
Member

kazuho commented Apr 4, 2020

Thank you for the PR.

Am I correct to assume that the benchmark was conducted on a low-latency, high-bandwidth network, but still sees some congestions? I would like to reproduce the numbers you saw.

@AloysAugustin
Copy link
Author

AloysAugustin commented Apr 4, 2020

Yes, the tests were performed on a very low latency, high throughput network.
We tested with our VPP integration where this change allowed us to go from 5.9 to 6.3Gbit/s with 2000 bytes MTU.
The improvement can also be seen with qperf (https://github.com/rbruenig/qperf), which is probably simpler to use.

Here are the results I obtained with qperf on a skylake machine running Ubuntu 18.04, with all default options:
Quicly master:

~/qperf/build$ ./qperf -c localhost
running client with host=localhost and runtime=10s
connection establishment time: 5ms
time to first byte: 5ms
second 0: 2.02 gbit/s (271150745 bytes received)
second 1: 2.083 gbit/s (279641717 bytes received)
second 2: 2.141 gbit/s (287408614 bytes received)
second 3: 2.156 gbit/s (289314994 bytes received)
second 4: 2.145 gbit/s (287874893 bytes received)
second 5: 2.156 gbit/s (289403653 bytes received)
second 6: 2.181 gbit/s (292782550 bytes received)
second 7: 2.181 gbit/s (292740093 bytes received)
second 8: 2.165 gbit/s (290612150 bytes received)
second 9: 2.105 gbit/s (282572043 bytes received)
connection closed

With this change:

~/qperf/build$ ./qperf -c localhost
running client with host=localhost and runtime=10s
connection establishment time: 5ms
time to first byte: 5ms
second 0: 1021 mbit/s (133791763 bytes received)
second 1: 2.271 gbit/s (304856781 bytes received)
second 2: 2.312 gbit/s (310342380 bytes received)
second 3: 2.297 gbit/s (308330612 bytes received)
second 4: 2.308 gbit/s (309803357 bytes received)
second 5: 2.237 gbit/s (300284822 bytes received)
second 6: 2.301 gbit/s (308882886 bytes received)
second 7: 2.349 gbit/s (315291804 bytes received)
second 8: 2.393 gbit/s (321175290 bytes received)
second 9: 2.398 gbit/s (321790661 bytes received)
connection closed

@kazuho
Copy link
Member

kazuho commented Apr 8, 2020

Thank you for elaborating.

I tested this PR using my testbed, saturating a 1Gbps link, by capping the CPU clock frequency of the sender machine to 400MHz. The receiver's CPU was running at much higher clock rate, and there was plenty of idle time.

master: 333.1Mbps
this PR: 330.3Mbps

As can be seen, I did not observe improvement.

The approach being adopted by this PR has the following trade-off. It reduces the cost of skipping 1 to 15 entries in sentmap, at the cost of using more memory calling malloc and free more frequently. I wonder if we could eliminate the skips without having the negative side-effect of the latter.

That said, I would assume that this optimization would become much less effective once we adopt #319. In case of the testbed explained above, we see the speed go up to 432Mbps, as the number of ACK packets get reduced to 1/3.

Would you be interested in checking if you still see this PR makes difference when #319 is used as the base (i.e. compare #319 vs. #319 + this PR)?

This improves the throughput of quicly by 5-7% in benchmarks. The
discard code used to be a bottleneck. This structure allows
removing packets from the sentmap in O(1) time.
@AloysAugustin
Copy link
Author

Thank you for taking the time to try this PR.

I ran more tests with and without #319, both with qperf and VPP.
What I noticed is that this PR makes little difference unless there are dropped packets in the test.
For instance with VPP, we reach 6.1 Gbps with the current master (which includes #319), and 6.3 with this PR on top of master, with ~100 packets dropped per second.
I think this is due to the fact that drops cause out-of-order deletes in the sentmap, which requires walking through the previous entries with the current implementation.

Overall I think this deserves more testing at this point, to run tests while controlling more accurately the drop rate and maybe switch the congestion control algorithm to something that can sustain high throughputs even in the presence of drops. I will let you know if I get to spend some time on it.

As a side note, I also tried to keep the allocated packets in a freelist in order to reduce the amount of calls to malloc / free, but I did not observe a significant difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants