Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

Open
laramiel opened this issue Apr 18, 2024 · 9 comments
Open

HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416

laramiel opened this issue Apr 18, 2024 · 9 comments
Assignees

Comments

@laramiel
Copy link
Contributor

laramiel commented Apr 18, 2024

I did this

As in #13403

In tensorstore we (currently) use a single multi-handle to multiplex a lot of http/2 transfers. Those transfers appear to be slower than expected.

The following benchmark, which downloads 1000 files over http/2 (limited to 32 concurrently), where each file is 64MB, demonstrates an issue with an excess of ::recvfrom calls:

At the lowest level of the stack, nw_in_read calls recv with a sequence of buffers of (typically) <5>, <1403> ... repeatedly.
There is special handling for "small" buffers in cf_socket_recv, however that is defined as < 1024 bytes, which relies on ctx->buffer_recv to be set, however I cannot see how it is ever TRUE.

Relevant stack trace & buffer lengths.

nw_in_read (curl/src/lib/cf-socket.c:863)   len = 1403
cf_socket_recv (curl/src/lib/cf-socket.c:1370)   len = 1403
Curl_cf_def_recv (curl/src/lib/cfilters.c:103)
Curl_conn_cf_recv (curl/src/lib/cfilters.c:322)
ossl_bio_cf_in_read (curl/src/lib/vtls/openssl.c:763)

BIO_read (openssl/boringssl/src/crypto/bio/bio.c:139)   len = 1403
bssl::tls_read_buffer_extend_to(ssl_st*, unsigned long) (openssl/boringssl/src/ssl/ssl_buffer.cc:157)   buf.size = 5, buf.capacity = 1408
bssl::ssl_read_buffer_extend_to(ssl_st*, unsigned long) (openssl/boringssl/src/ssl/ssl_buffer.cc:196)
bssl::ssl_handle_open_record(ssl_st*, bool*, bssl::ssl_open_record_t, unsigned long, unsigned char) (openssl/boringssl/src/ssl/ssl_buffer.cc:222)
ssl_read_impl(ssl_st*) (openssl/boringssl/src/ssl/ssl_lib.cc:1019)    <no buffer size>
::SSL_peek(SSL *, void *, int) (openssl/boringssl/src/ssl/ssl_lib.cc:1053)  bufsize = 15914
::SSL_read(SSL *, void *, int) (openssl/boringssl/src/ssl/ssl_lib.cc:1033) 

ossl_recv (curl/src/lib/vtls/openssl.c:4670)   
ssl_cf_recv (curl/src/lib/vtls/vtls.c:1725)
Curl_cf_def_recv (curl/src/lib/cfilters.c:103)
Curl_conn_cf_recv (curl/src/lib/cfilters.c:322)
nw_in_reader (curl/src/lib/http2.c:362)     buflen = 16384
chunk_slurpn (curl/src/lib/bufq.c:109)
Curl_bufq_sipn (curl/src/lib/bufq.c:594)
bufq_slurpn (curl/src/lib/bufq.c:623)
Curl_bufq_slurp (curl/src/lib/bufq.c:655)
h2_progress_ingress (curl/src/lib/http2.c:1911)
cf_h2_recv (curl/src/lib/http2.c:1969)
Curl_cf_def_recv (curl/src/lib/cfilters.c:103)
Curl_conn_recv (curl/src/lib/cfilters.c:183)
Curl_read (curl/src/lib/sendf.c:813)
Curl_xfer_recv_resp (curl/src/lib/transfer.c:450)
readwrite_data (curl/src/lib/transfer.c:504)

Repro:

git clone https://github.com/google/tensorstore.git

./bazelisk.py build -c opt --copt=-g tensorstore/internal/benchmark:kvstore_benchmark

strace -e recvfrom -f \
./bazel-bin/tensorstore/internal/benchmark/kvstore_benchmark \
   --kvstore_spec='{"driver": "gcs", "bucket":"<MY_GCS_BUCKET>", "path":"default"}' \
   --chunk_size=67108864 \
   --total_bytes=67108864000 \
   --repeat_reads=1 \
   --repeat_writes=0
   

Example strace output:

[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\313\265\v?\343\346=?\272&e\220\342V\322\207\35X6C\203\203V\250\346\346\301\2067W\204\252"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "O\1\210\301\307x(f\321\337\3469f\323\nAUtS\326\234\364|\244\\\275\275)K.\250H"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\224\255\317Uj\274W\310J\344\32\264\243\254\224(v\262\261\214Y\253}`\374R\323\316nj\222u"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\"\21\3606\3106Y5Z/\233\236\211\261\305~\v\17\272<\351\225Qx\303\177p\331\2035o,"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\326\352\224bU5\337\23\342p`\0342\327\31\313?R\33rb\34M\237)U\204\302\f\215'P"..., 1403, 0, NULL, NULL) = 1403
[pid 3324193] recvfrom(46, "\27\3\3\5\10", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "\331\214\17\352C\23K\232\34\374\275\357\220\227\16\250\360\355e\342\22$j|\35\313>\274\237N\311\210"..., 1288, 0, NULL, NULL) = 1288
[pid 3324193] recvfrom(46, "\27\3\3\5{", 5, 0, NULL, NULL) = 5
[pid 3324193] recvfrom(46, "B\312\345 \314\7\302s\2346u>\262\r\226'\317\256\3515\311\26\2O\220fr\21\264-\345["..., 1403, 0, NULL, NULL) = 1403

Here we clearly see the TLS header then the TLS body reads as separate ::recv calls.

I took the histogram of recvfrom calls; here are all the calls with more than 500 occurences in the download.
Notice that about 1/2 the calls are for 5 bytes (TLS headers), while the others vary, but a large number are for
actual 8k blocks. The average recv call size is about 4090 bytes.

count buffer_size
516 28
548 8203
552 8212
558 29
569 6980
664 8202
753 8201
988 8204
993 8206
1012 26
1140 34
1158 27
1653 8210
2476 8208
2496 1288
2750 8207
12715 1403
8183520 8218
8228072 5

I expected the following

Fewer calls to recvfrom; buffering at the lowest layer for fewer context switches.

curl/libcurl version

curl 8.7.1

operating system

5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux

@bagder
Copy link
Member

bagder commented Apr 19, 2024

@icing's comment when this was previously brought up on the mailing list: https://curl.se/mail/lib-2024-04/0000.html

@icing
Copy link
Contributor

icing commented Apr 19, 2024

If I understand the reported issue correctly, you would prefer some receive buffering in the socket filter, so that the TLS filter above can do its small reads without involving socket operations. Is that correct?

@icing icing self-assigned this Apr 19, 2024
@laramiel
Copy link
Contributor Author

While it's a tradeoff, and not exactly what the other thread is about, I believe that when TLS is used, the lack of buffering on recv is likely to increase latency quite a bit.

Since TLS issues a 5 byte read and then the payload read, that extra syscall overhead will increase latency quite a bit.
I think that having a buffer wired into the BIO_read call could really payoff.

If we look at the SSL_read call, the upper layer is expecting to read 16kb (the chunk size for H2); that 16kb requires at least 4 ::recvfrom calls, and maybe a lot more: header:5, body:8k, header:5, body:8k

From my understanding, openssl has a buffering BIO mechanism that could be added in; also there's the existing curl bufq available.

@laramiel
Copy link
Contributor Author

laramiel commented Apr 19, 2024

I think that the SSL_set_bio call in ossl_connect_step1 could use chained BIOs created via BIO_new_pair, which has read/write buffers built-in. It appears that grpc uses this mechanism for their ssl configuration.

https://github.com/grpc/grpc/blob/ba78e4ff475550e1fa3fd09fbcee326c3c6dfd5a/src/core/tsi/ssl_transport_security.cc#L1891

In BIO_new_bio_pair (crypto/bio/pair.c), when creating paired BIO buffers, 17kb is allocated. Mimicking that should be enough of a buffer in curl, I'd imagine.

@icing
Copy link
Contributor

icing commented Apr 20, 2024

Citing https://google.github.io/tensorstore/

"Natively supports multiple storage drivers, including Google Cloud Storage"

We are talking about improving performance for Google Cloud Storage?

@laramiel
Copy link
Contributor Author

laramiel commented Apr 22, 2024

Not really? Somewhat? The recvfrom calls are generic and all large transfers are affected.

tensorstore supports a lot of potential backends. The HTTP/2 stack is shared across our "http", "aws", and "gcs" access methods--it does not use the google-cloud-cpp libraries nor the aws c++ client libraries--and for most google uses the gRPC backend is higher performing than the HTTP/2 backend. It just happens that testing against GCS is pretty convenient. I could alternatively test against AWS or a generic HTTP/2 dataset.

The project itself is developed for our connectomics research use. It is open source so our collaborators have access to it, and it happens to have uses beyond connectomics to many large datasets. See the blog post: https://research.google/blog/tensorstore-for-high-performance-scalable-array-storage

@jbms
Copy link
Contributor

jbms commented Apr 22, 2024

To be clear, we are observing about 50% of the CPU time spent in calls to recv in profiles. Since ultimately this is just doing a memcpy from some kernel buffer, in principle this should be able to be reduced to almost zero. The remainder of the time is split on decryption performed by OpenSSL and other memory copies. Therefore we expect that the throughput from a single thread could be nearly 2x the current throughput if the excessive small recv calls were eliminated.

It would be better if the excessive recv calls could be avoided without additional memory copies -- it is not clear how easily that can be accomplished. But even if additional memory copies are needed, it is likely to improve performance, because memcpy is much cheaper than a recv call. Indeed, failing to use buffered I/O when performing small reads from a file descriptor is a classic performance pitfall.

@icing
Copy link
Contributor

icing commented Apr 23, 2024

I agree that curl can be improved here for better performance in high traffic applications. I propose to add that to the project's TODO list, so that anyone willing to invest time can see it. Maybe such an individual will come forward in the future or some organisation decides it is willing to make a financial contribution to push this.

Does this sound good?

@laramiel
Copy link
Contributor Author

laramiel commented Apr 29, 2024

In

/* buffering is currently disabled by default because we have stalls

If I change FALSE to TRUE here (ignoring the stalling issues) then the results change from >50% in recv (66s) to ~30% of the time in recv (24s), and throughput increases by about 50% (511 MB/s to 780 MB/s, single threaded, YMMV, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants