Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow setting buffer sizes on server_socket #2458

Closed

Conversation

travisdowns
Copy link
Contributor

We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept().

It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size, but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer.

Setting the buffer size on the listening socket however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation.

A test is included which checks that the mechanism, including the inheritance, works.

@travisdowns
Copy link
Contributor Author

travisdowns commented Sep 29, 2024

This was discovered due to very poor throughput between a remote client with ~250 rtt and Redpanda: this transfer is receive window limited and benefits from buffers > 1 MB, but using such buffers (we set the configured buffer size on the connected_socket immediately after connection) had no effect despite the change taking effect per ss. The problem was that setting it "too late" prevents the receiving side from advertising the larger size in its receive window. Arguably a kernel flaw? The window scale was set high enough (128x) even when setting it in this way, so that wasn't preventing the window from scaling though this is often given as the reason why "late setting" the recv buffer size does not work.

auto buf_size = [](std::optional<int> buf_size) {
listen_options lo{
.reuse_address = true,
.lba = server_socket::load_balancing_algorithm::fixed,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we create multiple connections, we need to ensure we LB them all to shard 0, or else the accept() will just hang: this definite caused some confusion for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it hang?

Copy link
Contributor Author

@travisdowns travisdowns Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test connects to the server socket and does an accept() on that socket (lines 280 + 281) expecting to get the connection we just made, all on shard 0, but if load balancing is on the second time through the connection will be send to shard 1, so the accept() on shard will not receive it and so will "hang".

All the other tests just happen to work I think since they only ever accept() a single connection on any server_socket and this first connection always balances to shard 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had a discussion about this problem on the Seastar mailing list last year - see https://groups.google.com/g/seastar-dev/c/lMEXD_M04gE/m/hF41lhtDCAAJ, and the conclusions were that:

  1. This issue should be described in the woefully incomplete tutorial.md. Everybody who tried to write trivial servers in Seastar will encounter this problem and waste hours understanding what he did wrong - I did too.
  2. We have a document doc/network-connection-load-balancing.md
    written by @gleb-cloudius which describes load_balancing_algorithm::fixed but is missing the newer set_fixed_cpu() function which is (supposedly?) easier to use.

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have a comment about a typo, and one nitpick about the test: I think it's better to test with different send and receive buffer sizes, to make sure you don't have copy-pasto bugs in the implementation (e.g., setting send buffer size from the receive buffer parameter).

@@ -389,13 +389,28 @@ public:

/// @}

/// Options for creating a listening socket. These options only have an effect
/// when using the POSIX stack: all options are ignored on the native stack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is surprising - if it's only for the POSIX stack, why is it in net/api.hh?
@avikivity @gleb-cloudius is this true, and if so, is this deliberate or a bug?

Copy link
Contributor

@gleb-cloudius gleb-cloudius Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are waiting to be implemented. Implementing some of them are as simple as implementing UDP stack. Some are more complicated. On a serious note at least proto should be used for sanity check (error out on unsupported one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the comment should contain words like "currently", "TODO", "warning" and so on - the way it is phrased today make it sound like this is intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this comment as while implementing the option for the POSIX stack I wanted to understand the implication on the native side, and found that we simply drop the options on that side.

I will make the comment less accepting of the status quo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment to indicate this is not the desired end state in 5b33a11.


/// If set, the SO_SNDBUF size will be set to the given value on the listening socket
/// via setsockopt. This buffer size is inherited by the sockets returned by
/// accept and is the preferred way to set the buffer size for this sockets since
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: should be "these" sockets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same typo in the copy-pasted comment below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eagle eye 🦅 ! I've fixed this in push 5b33a11.

include/seastar/net/api.hh Show resolved Hide resolved
auto buf_size = [](std::optional<int> buf_size) {
listen_options lo{
.reuse_address = true,
.lba = server_socket::load_balancing_algorithm::fixed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had a discussion about this problem on the Seastar mailing list last year - see https://groups.google.com/g/seastar-dev/c/lMEXD_M04gE/m/hF41lhtDCAAJ, and the conclusions were that:

  1. This issue should be described in the woefully incomplete tutorial.md. Everybody who tried to write trivial servers in Seastar will encounter this problem and waste hours understanding what he did wrong - I did too.
  2. We have a document doc/network-connection-load-balancing.md
    written by @gleb-cloudius which describes load_balancing_algorithm::fixed but is missing the newer set_fixed_cpu() function which is (supposedly?) easier to use.

.reuse_address = true,
.lba = server_socket::load_balancing_algorithm::fixed,
.so_sndbuf = buf_size,
.so_rcvbuf = buf_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Why did you decide to check this function with just one parameter (buf_size) used for both sndbuf and rcvbuf? If you had a copy-pasto in the implementation where SNDBUF was taken from the rcvbuf parameter, your test wouldn't catch it.
I think it will be better to pass two options, sndbuf and rcvbuf, to this test function, and pass different sizes to it - not just two identical small and two identical large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've updated this to use separate sizes in 5b33a11.

// explicit small buffer ends up smaller than the explicit big buffer,
// and that both results are at least as large as the requested amount.
// The latter condition could plausibly fail if the OS clamped the size
// at a small amount, but this is unlikely for the chosen buffer sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, so strange. This bizarre doubling is even documented in socket(7).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nyh I think part/all of it is because people have certain expectations about what the socket buffer is, but Linux has a more expansive definition. E.g., Linux uses part of the buffer for things other than the data itself, like metadata. So a user who makes a 100KB socket may expect to write 100KB w/o blocking on an empty socket, but Linux only makes 50KB available for that by default (say). By doubling the size, this cancels out with the half taken away for internal purposes, so the expectations are met (though it is perhaps weird that it returns the doubled size in getsockopt calls).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are newer APIs (for CRIU) which round-trip the values.

We add two options to set the recv and send (SO_RCVBUF, ...) buffer
sizes on a listening socket (server_socket). This is mostly useful to
propagate said sizes to all sockets returned by accept().

It is already possible to set the socket option directly on the
connected socket after it returned by accept() but experimentally this
results in a socket with the specified buffer size but whose
receive window will not be advertised to the client beyond the default
(64K for current typical kernel defaults). So you get only some of the
benefit of the larger buffer.

Setting the buffer size on the listening socket, however, is mentioned
as the correct approach in tcp(7) and does not suffer from the same
limitation.

A test is included which checks that the mechanism, including the
inheritance, works.
@travisdowns
Copy link
Contributor Author

@nyh wrote:

Looks good, I just have a comment about a typo, and one nitpick about the test: I think it's better to test with different send and receive buffer sizes, to make sure you don't have copy-pasto bugs in the implementation (e.g., setting send buffer size from the receive buffer parameter).

Thanks for the feedback: I believe I've addressed all of it in force:

5b33a11

I also added a final test case which tests the "nullopt" sizes for both buffers: hard to validate that it DTRT but I check at least that the discovered size falls in the expected range.

@travisdowns
Copy link
Contributor Author

@nyh - all feedback addressed, please have a look when you have a chance, thanks!

@nyh nyh closed this in 4cb7f8e Oct 20, 2024
StephanDollberg pushed a commit to redpanda-data/seastar that referenced this pull request Oct 23, 2024
We add two options to set the recv and send (SO_RCVBUF, ...) buffer
sizes on a listening socket (server_socket). This is mostly useful to
propagate said sizes to all sockets returned by accept().

It is already possible to set the socket option directly on the
connected socket after it returned by accept() but experimentally this
results in a socket with the specified buffer size but whose
receive window will not be advertised to the client beyond the default
(64K for current typical kernel defaults). So you get only some of the
benefit of the larger buffer.

Setting the buffer size on the listening socket, however, is mentioned
as the correct approach in tcp(7) and does not suffer from the same
limitation.

A test is included which checks that the mechanism, including the
inheritance, works.

Closes scylladb#2458

(cherry picked from commit 4cb7f8e)
StephanDollberg pushed a commit to redpanda-data/seastar that referenced this pull request Oct 23, 2024
We add two options to set the recv and send (SO_RCVBUF, ...) buffer
sizes on a listening socket (server_socket). This is mostly useful to
propagate said sizes to all sockets returned by accept().

It is already possible to set the socket option directly on the
connected socket after it returned by accept() but experimentally this
results in a socket with the specified buffer size but whose
receive window will not be advertised to the client beyond the default
(64K for current typical kernel defaults). So you get only some of the
benefit of the larger buffer.

Setting the buffer size on the listening socket, however, is mentioned
as the correct approach in tcp(7) and does not suffer from the same
limitation.

A test is included which checks that the mechanism, including the
inheritance, works.

Closes scylladb#2458

(cherry picked from commit 4cb7f8e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants