Sentmap refactorisation #314

AloysAugustin · 2020-04-03T13:22:55Z

Hi Kazuho,

I looked at the performance of Quicly yesterday and noticed a bottleneck in the sentmap at high throughputs. This PR proposes a new structure based on a dlist of packets for the sentmap, which improves the throughput by 5-7% in our tests.

kazuho · 2020-04-04T00:48:45Z

Thank you for the PR.

Am I correct to assume that the benchmark was conducted on a low-latency, high-bandwidth network, but still sees some congestions? I would like to reproduce the numbers you saw.

AloysAugustin · 2020-04-04T13:24:20Z

Yes, the tests were performed on a very low latency, high throughput network.
We tested with our VPP integration where this change allowed us to go from 5.9 to 6.3Gbit/s with 2000 bytes MTU.
The improvement can also be seen with qperf (https://github.com/rbruenig/qperf), which is probably simpler to use.

Here are the results I obtained with qperf on a skylake machine running Ubuntu 18.04, with all default options:
Quicly master:

~/qperf/build$ ./qperf -c localhost
running client with host=localhost and runtime=10s
connection establishment time: 5ms
time to first byte: 5ms
second 0: 2.02 gbit/s (271150745 bytes received)
second 1: 2.083 gbit/s (279641717 bytes received)
second 2: 2.141 gbit/s (287408614 bytes received)
second 3: 2.156 gbit/s (289314994 bytes received)
second 4: 2.145 gbit/s (287874893 bytes received)
second 5: 2.156 gbit/s (289403653 bytes received)
second 6: 2.181 gbit/s (292782550 bytes received)
second 7: 2.181 gbit/s (292740093 bytes received)
second 8: 2.165 gbit/s (290612150 bytes received)
second 9: 2.105 gbit/s (282572043 bytes received)
connection closed

With this change:

~/qperf/build$ ./qperf -c localhost
running client with host=localhost and runtime=10s
connection establishment time: 5ms
time to first byte: 5ms
second 0: 1021 mbit/s (133791763 bytes received)
second 1: 2.271 gbit/s (304856781 bytes received)
second 2: 2.312 gbit/s (310342380 bytes received)
second 3: 2.297 gbit/s (308330612 bytes received)
second 4: 2.308 gbit/s (309803357 bytes received)
second 5: 2.237 gbit/s (300284822 bytes received)
second 6: 2.301 gbit/s (308882886 bytes received)
second 7: 2.349 gbit/s (315291804 bytes received)
second 8: 2.393 gbit/s (321175290 bytes received)
second 9: 2.398 gbit/s (321790661 bytes received)
connection closed

kazuho · 2020-04-08T05:41:13Z

Thank you for elaborating.

I tested this PR using my testbed, saturating a 1Gbps link, by capping the CPU clock frequency of the sender machine to 400MHz. The receiver's CPU was running at much higher clock rate, and there was plenty of idle time.

master: 333.1Mbps
this PR: 330.3Mbps

As can be seen, I did not observe improvement.

The approach being adopted by this PR has the following trade-off. It reduces the cost of skipping 1 to 15 entries in sentmap, at the cost of using more memory calling malloc and free more frequently. I wonder if we could eliminate the skips without having the negative side-effect of the latter.

That said, I would assume that this optimization would become much less effective once we adopt #319. In case of the testbed explained above, we see the speed go up to 432Mbps, as the number of ACK packets get reduced to 1/3.

Would you be interested in checking if you still see this PR makes difference when #319 is used as the base (i.e. compare #319 vs. #319 + this PR)?

This improves the throughput of quicly by 5-7% in benchmarks. The discard code used to be a bottleneck. This structure allows removing packets from the sentmap in O(1) time.

AloysAugustin · 2020-04-13T15:19:29Z

Thank you for taking the time to try this PR.

I ran more tests with and without #319, both with qperf and VPP.
What I noticed is that this PR makes little difference unless there are dropped packets in the test.
For instance with VPP, we reach 6.1 Gbps with the current master (which includes #319), and 6.3 with this PR on top of master, with ~100 packets dropped per second.
I think this is due to the fact that drops cause out-of-order deletes in the sentmap, which requires walking through the previous entries with the current implementation.

Overall I think this deserves more testing at this point, to run tests while controlling more accurately the drop rate and maybe switch the congestion control algorithm to something that can sustain high throughputs even in the presence of drops. I will let you know if I get to spend some time on it.

As a side note, I also tried to keep the allocated packets in a freelist in order to reduce the amount of calls to malloc / free, but I did not observe a significant difference.

AloysAugustin added 3 commits April 13, 2020 17:00

prevent cwnd overflow

f0b5801

Use a dlist of packets for the sentmap.

e6c7895

This improves the throughput of quicly by 5-7% in benchmarks. The discard code used to be a bottleneck. This structure allows removing packets from the sentmap in O(1) time.

Fix comment, remove unnecessary parentheses

d89d28d

AloysAugustin force-pushed the sentmap_optim branch from 3d51b09 to d89d28d Compare April 13, 2020 15:00

kazuho mentioned this pull request Apr 28, 2020

Minimizing the cost of sentmap iteration #332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentmap refactorisation #314

Sentmap refactorisation #314

AloysAugustin commented Apr 3, 2020

kazuho commented Apr 4, 2020

AloysAugustin commented Apr 4, 2020 •

edited

Loading

kazuho commented Apr 8, 2020

AloysAugustin commented Apr 13, 2020

Sentmap refactorisation #314

Are you sure you want to change the base?

Sentmap refactorisation #314

Conversation

AloysAugustin commented Apr 3, 2020

kazuho commented Apr 4, 2020

AloysAugustin commented Apr 4, 2020 • edited Loading

kazuho commented Apr 8, 2020

AloysAugustin commented Apr 13, 2020

AloysAugustin commented Apr 4, 2020 •

edited

Loading