HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

laramiel · 2024-04-18T22:27:10Z

I did this

In tensorstore we (currently) use a single multi-handle to multiplex a lot of http/2 transfers. Those transfers appear to be slower than expected.

The following benchmark, which downloads 1000 files over http/2 (limited to 32 concurrently), where each file is 64MB, demonstrates an issue with an excess of ::recvfrom calls:

At the lowest level of the stack, nw_in_read calls recv with a sequence of buffers of (typically) <5>, <1403> ... repeatedly.
There is special handling for "small" buffers in cf_socket_recv, however that is defined as < 1024 bytes, which relies on ctx->buffer_recv to be set, however I cannot see how it is ever TRUE.

Relevant stack trace & buffer lengths.

nw_in_read (curl/src/lib/cf-socket.c:863)   len = 1403
cf_socket_recv (curl/src/lib/cf-socket.c:1370)   len = 1403
Curl_cf_def_recv (curl/src/lib/cfilters.c:103)
Curl_conn_cf_recv (curl/src/lib/cfilters.c:322)
ossl_bio_cf_in_read (curl/src/lib/vtls/openssl.c:763)

BIO_read (openssl/boringssl/src/crypto/bio/bio.c:139)   len = 1403
bssl::tls_read_buffer_extend_to(ssl_st*, unsigned long) (openssl/boringssl/src/ssl/ssl_buffer.cc:157)   buf.size = 5, buf.capacity = 1408
bssl::ssl_read_buffer_extend_to(ssl_st*, unsigned long) (openssl/boringssl/src/ssl/ssl_buffer.cc:196)
bssl::ssl_handle_open_record(ssl_st*, bool*, bssl::ssl_open_record_t, unsigned long, unsigned char) (openssl/boringssl/src/ssl/ssl_buffer.cc:222)
ssl_read_impl(ssl_st*) (openssl/boringssl/src/ssl/ssl_lib.cc:1019)    <no buffer size>
::SSL_peek(SSL *, void *, int) (openssl/boringssl/src/ssl/ssl_lib.cc:1053)  bufsize = 15914
::SSL_read(SSL *, void *, int) (openssl/boringssl/src/ssl/ssl_lib.cc:1033) 

ossl_recv (curl/src/lib/vtls/openssl.c:4670)   
ssl_cf_recv (curl/src/lib/vtls/vtls.c:1725)
Curl_cf_def_recv (curl/src/lib/cfilters.c:103)
Curl_conn_cf_recv (curl/src/lib/cfilters.c:322)
nw_in_reader (curl/src/lib/http2.c:362)     buflen = 16384
chunk_slurpn (curl/src/lib/bufq.c:109)
Curl_bufq_sipn (curl/src/lib/bufq.c:594)
bufq_slurpn (curl/src/lib/bufq.c:623)
Curl_bufq_slurp (curl/src/lib/bufq.c:655)
h2_progress_ingress (curl/src/lib/http2.c:1911)
cf_h2_recv (curl/src/lib/http2.c:1969)
Curl_cf_def_recv (curl/src/lib/cfilters.c:103)
Curl_conn_recv (curl/src/lib/cfilters.c:183)
Curl_read (curl/src/lib/sendf.c:813)
Curl_xfer_recv_resp (curl/src/lib/transfer.c:450)
readwrite_data (curl/src/lib/transfer.c:504)

Repro:

git clone https://github.com/google/tensorstore.git

./bazelisk.py build -c opt --copt=-g tensorstore/internal/benchmark:kvstore_benchmark

strace -e recvfrom -f \
./bazel-bin/tensorstore/internal/benchmark/kvstore_benchmark \
   --kvstore_spec='{"driver": "gcs", "bucket":"<MY_GCS_BUCKET>", "path":"default"}' \
   --chunk_size=67108864 \
   --total_bytes=67108864000 \
   --repeat_reads=1 \
   --repeat_writes=0

Example strace output:

[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\313\265\v?\343\346=?\272&e\220\342V\322\207\35X6C\203\203V\250\346\346\301\2067W\204\252"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "O\1\210\301\307x(f\321\337\3469f\323\nAUtS\326\234\364|\244\\\275\275)K.\250H"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\224\255\317Uj\274W\310J\344\32\264\243\254\224(v\262\261\214Y\253}`\374R\323\316nj\222u"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\"\21\3606\3106Y5Z/\233\236\211\261\305~\v\17\272<\351\225Qx\303\177p\331\2035o,"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\326\352\224bU5\337\23\342p`\0342\327\31\313?R\33rb\34M\237)U\204\302\f\215'P"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5\10", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\331\214\17\352C\23K\232\34\374\275\357\220\227\16\250\360\355e\342\22$j|\35\313>\274\237N\311\210"..., 1288, 0, NULL, NULL) = 1288
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "B\312\345 \314\7\302s\2346u>\262\r\226'\317\256\3515\311\26\2O\220fr\21\264-\345["..., 1403, 0, NULL, NULL) = 1403

Here we clearly see the TLS header then the TLS body reads as separate ::recv calls.

I took the histogram of recvfrom calls; here are all the calls with more than 500 occurences in the download.
Notice that about 1/2 the calls are for 5 bytes (TLS headers), while the others vary, but a large number are for
actual 8k blocks. The average recv call size is about 4090 bytes.

count	buffer_size
516	28
548	8203
552	8212
558	29
569	6980
664	8202
753	8201
988	8204
993	8206
1012	26
1140	34
1158	27
1653	8210
2476	8208
2496	1288
2750	8207
12715	1403
`8183520`	8218
8228072	5

I expected the following

Fewer calls to recvfrom; buffering at the lowest layer for fewer context switches.

curl/libcurl version

curl 8.7.1

operating system

5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

bagder · 2024-04-19T06:59:24Z

@icing's comment when this was previously brought up on the mailing list: https://curl.se/mail/lib-2024-04/0000.html

icing · 2024-04-19T07:52:09Z

If I understand the reported issue correctly, you would prefer some receive buffering in the socket filter, so that the TLS filter above can do its small reads without involving socket operations. Is that correct?

laramiel · 2024-04-19T16:09:52Z

While it's a tradeoff, and not exactly what the other thread is about, I believe that when TLS is used, the lack of buffering on recv is likely to increase latency quite a bit.

Since TLS issues a 5 byte read and then the payload read, that extra syscall overhead will increase latency quite a bit.
I think that having a buffer wired into the BIO_read call could really payoff.

If we look at the SSL_read call, the upper layer is expecting to read 16kb (the chunk size for H2); that 16kb requires at least 4 ::recvfrom calls, and maybe a lot more: header:5, body:8k, header:5, body:8k

From my understanding, openssl has a buffering BIO mechanism that could be added in; also there's the existing curl bufq available.

laramiel · 2024-04-19T17:46:22Z

I think that the SSL_set_bio call in ossl_connect_step1 could use chained BIOs created via BIO_new_pair, which has read/write buffers built-in. It appears that grpc uses this mechanism for their ssl configuration.

https://github.com/grpc/grpc/blob/ba78e4ff475550e1fa3fd09fbcee326c3c6dfd5a/src/core/tsi/ssl_transport_security.cc#L1891

In BIO_new_bio_pair (crypto/bio/pair.c), when creating paired BIO buffers, 17kb is allocated. Mimicking that should be enough of a buffer in curl, I'd imagine.

icing · 2024-04-20T09:16:10Z

Citing https://google.github.io/tensorstore/

"Natively supports multiple storage drivers, including Google Cloud Storage"

We are talking about improving performance for Google Cloud Storage?

laramiel · 2024-04-22T16:36:43Z

Not really? Somewhat? The recvfrom calls are generic and all large transfers are affected.

tensorstore supports a lot of potential backends. The HTTP/2 stack is shared across our "http", "aws", and "gcs" access methods--it does not use the google-cloud-cpp libraries nor the aws c++ client libraries--and for most google uses the gRPC backend is higher performing than the HTTP/2 backend. It just happens that testing against GCS is pretty convenient. I could alternatively test against AWS or a generic HTTP/2 dataset.

The project itself is developed for our connectomics research use. It is open source so our collaborators have access to it, and it happens to have uses beyond connectomics to many large datasets. See the blog post: https://research.google/blog/tensorstore-for-high-performance-scalable-array-storage

jbms · 2024-04-22T18:57:34Z

To be clear, we are observing about 50% of the CPU time spent in calls to recv in profiles. Since ultimately this is just doing a memcpy from some kernel buffer, in principle this should be able to be reduced to almost zero. The remainder of the time is split on decryption performed by OpenSSL and other memory copies. Therefore we expect that the throughput from a single thread could be nearly 2x the current throughput if the excessive small recv calls were eliminated.

It would be better if the excessive recv calls could be avoided without additional memory copies -- it is not clear how easily that can be accomplished. But even if additional memory copies are needed, it is likely to improve performance, because memcpy is much cheaper than a recv call. Indeed, failing to use buffered I/O when performing small reads from a file descriptor is a classic performance pitfall.

icing · 2024-04-23T08:22:46Z

I agree that curl can be improved here for better performance in high traffic applications. I propose to add that to the project's TODO list, so that anyone willing to invest time can see it. Maybe such an individual will come forward in the future or some organisation decides it is willing to make a financial contribution to push this.

Does this sound good?

laramiel · 2024-04-29T22:01:23Z

In

curl/lib/cf-socket.c

Line 1456 in 86805bf

/* buffering is currently disabled by default because we have stalls

If I change FALSE to TRUE here (ignoring the stalling issues) then the results change from >50% in recv (66s) to ~30% of the time in recv (24s), and throughput increases by about 50% (511 MB/s to 780 MB/s, single threaded, YMMV, etc).

bagder added HTTP/2 performance labels Apr 19, 2024

icing self-assigned this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

laramiel commented Apr 18, 2024 •

edited

bagder commented Apr 19, 2024

icing commented Apr 19, 2024

laramiel commented Apr 19, 2024

laramiel commented Apr 19, 2024 •

edited

icing commented Apr 20, 2024

laramiel commented Apr 22, 2024 •

edited

jbms commented Apr 22, 2024

icing commented Apr 23, 2024 •

edited

laramiel commented Apr 29, 2024 •

edited

HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

Comments

laramiel commented Apr 18, 2024 • edited

I did this

I expected the following

curl/libcurl version

operating system

bagder commented Apr 19, 2024

icing commented Apr 19, 2024

laramiel commented Apr 19, 2024

laramiel commented Apr 19, 2024 • edited

icing commented Apr 20, 2024

laramiel commented Apr 22, 2024 • edited

jbms commented Apr 22, 2024

icing commented Apr 23, 2024 • edited

laramiel commented Apr 29, 2024 • edited

laramiel commented Apr 18, 2024 •

edited

laramiel commented Apr 19, 2024 •

edited

laramiel commented Apr 22, 2024 •

edited

icing commented Apr 23, 2024 •

edited

laramiel commented Apr 29, 2024 •

edited