-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow setting buffer sizes on server_socket #2458
Conversation
This was discovered due to very poor throughput between a remote client with ~250 rtt and Redpanda: this transfer is receive window limited and benefits from buffers > 1 MB, but using such buffers (we set the configured buffer size on the |
auto buf_size = [](std::optional<int> buf_size) { | ||
listen_options lo{ | ||
.reuse_address = true, | ||
.lba = server_socket::load_balancing_algorithm::fixed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we create multiple connections, we need to ensure we LB them all to shard 0, or else the accept() will just hang: this definite caused some confusion for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it hang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test connects to the server socket and does an accept() on that socket (lines 280 + 281) expecting to get the connection we just made, all on shard 0, but if load balancing is on the second time through the connection will be send to shard 1, so the accept() on shard will not receive it and so will "hang".
All the other tests just happen to work I think since they only ever accept() a single connection on any server_socket and this first connection always balances to shard 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we had a discussion about this problem on the Seastar mailing list last year - see https://groups.google.com/g/seastar-dev/c/lMEXD_M04gE/m/hF41lhtDCAAJ, and the conclusions were that:
- This issue should be described in the woefully incomplete tutorial.md. Everybody who tried to write trivial servers in Seastar will encounter this problem and waste hours understanding what he did wrong - I did too.
- We have a document doc/network-connection-load-balancing.md
written by @gleb-cloudius which describes load_balancing_algorithm::fixed but is missing the newer set_fixed_cpu() function which is (supposedly?) easier to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I just have a comment about a typo, and one nitpick about the test: I think it's better to test with different send and receive buffer sizes, to make sure you don't have copy-pasto bugs in the implementation (e.g., setting send buffer size from the receive buffer parameter).
include/seastar/net/api.hh
Outdated
@@ -389,13 +389,28 @@ public: | |||
|
|||
/// @} | |||
|
|||
/// Options for creating a listening socket. These options only have an effect | |||
/// when using the POSIX stack: all options are ignored on the native stack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is surprising - if it's only for the POSIX stack, why is it in net/api.hh?
@avikivity @gleb-cloudius is this true, and if so, is this deliberate or a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are waiting to be implemented. Implementing some of them are as simple as implementing UDP stack. Some are more complicated. On a serious note at least proto should be used for sanity check (error out on unsupported one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the comment should contain words like "currently", "TODO", "warning" and so on - the way it is phrased today make it sound like this is intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this comment as while implementing the option for the POSIX stack I wanted to understand the implication on the native side, and found that we simply drop the options on that side.
I will make the comment less accepting of the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment to indicate this is not the desired end state in 5b33a11.
include/seastar/net/api.hh
Outdated
|
||
/// If set, the SO_SNDBUF size will be set to the given value on the listening socket | ||
/// via setsockopt. This buffer size is inherited by the sockets returned by | ||
/// accept and is the preferred way to set the buffer size for this sockets since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: should be "these" sockets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same typo in the copy-pasted comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eagle eye 🦅 ! I've fixed this in push 5b33a11.
auto buf_size = [](std::optional<int> buf_size) { | ||
listen_options lo{ | ||
.reuse_address = true, | ||
.lba = server_socket::load_balancing_algorithm::fixed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we had a discussion about this problem on the Seastar mailing list last year - see https://groups.google.com/g/seastar-dev/c/lMEXD_M04gE/m/hF41lhtDCAAJ, and the conclusions were that:
- This issue should be described in the woefully incomplete tutorial.md. Everybody who tried to write trivial servers in Seastar will encounter this problem and waste hours understanding what he did wrong - I did too.
- We have a document doc/network-connection-load-balancing.md
written by @gleb-cloudius which describes load_balancing_algorithm::fixed but is missing the newer set_fixed_cpu() function which is (supposedly?) easier to use.
tests/unit/socket_test.cc
Outdated
.reuse_address = true, | ||
.lba = server_socket::load_balancing_algorithm::fixed, | ||
.so_sndbuf = buf_size, | ||
.so_rcvbuf = buf_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Why did you decide to check this function with just one parameter (buf_size) used for both sndbuf and rcvbuf? If you had a copy-pasto in the implementation where SNDBUF was taken from the rcvbuf parameter, your test wouldn't catch it.
I think it will be better to pass two options, sndbuf and rcvbuf, to this test function, and pass different sizes to it - not just two identical small and two identical large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I've updated this to use separate sizes in 5b33a11.
// explicit small buffer ends up smaller than the explicit big buffer, | ||
// and that both results are at least as large as the requested amount. | ||
// The latter condition could plausibly fail if the OS clamped the size | ||
// at a small amount, but this is unlikely for the chosen buffer sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, so strange. This bizarre doubling is even documented in socket(7).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nyh I think part/all of it is because people have certain expectations about what the socket buffer is, but Linux has a more expansive definition. E.g., Linux uses part of the buffer for things other than the data itself, like metadata. So a user who makes a 100KB socket may expect to write 100KB w/o blocking on an empty socket, but Linux only makes 50KB available for that by default (say). By doubling the size, this cancels out with the half taken away for internal purposes, so the expectations are met (though it is perhaps weird that it returns the doubled size in getsockopt calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are newer APIs (for CRIU) which round-trip the values.
We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept(). It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer. Setting the buffer size on the listening socket, however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation. A test is included which checks that the mechanism, including the inheritance, works.
fa9a3ad
to
5b33a11
Compare
@nyh wrote:
Thanks for the feedback: I believe I've addressed all of it in force: I also added a final test case which tests the "nullopt" sizes for both buffers: hard to validate that it DTRT but I check at least that the discovered size falls in the expected range. |
@nyh - all feedback addressed, please have a look when you have a chance, thanks! |
We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept(). It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer. Setting the buffer size on the listening socket, however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation. A test is included which checks that the mechanism, including the inheritance, works. Closes scylladb#2458 (cherry picked from commit 4cb7f8e)
We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept(). It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer. Setting the buffer size on the listening socket, however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation. A test is included which checks that the mechanism, including the inheritance, works. Closes scylladb#2458 (cherry picked from commit 4cb7f8e)
We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept().
It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size, but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer.
Setting the buffer size on the listening socket however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation.
A test is included which checks that the mechanism, including the inheritance, works.