Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use Linux's value for sysctl_nr_open_max #32740

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
30 changes: 14 additions & 16 deletions src/core/main.c
Original file line number Diff line number Diff line change
Expand Up @@ -1253,37 +1253,35 @@ static void bump_file_max_and_nr_open(void) {
#endif

#if BUMP_PROC_SYS_FS_NR_OPEN
int v = INT_MAX;

/* Argh! The kernel enforces maximum and minimum values on the fs.nr_open, but we don't really know
* what they are. The expression by which the maximum is determined is dependent on the architecture,
* and is something we don't really want to copy to userspace, as it is dependent on implementation
* details of the kernel. Since the kernel doesn't expose the maximum value to us, we can only try
* and hope. Hence, let's start with INT_MAX, and then keep halving the value until we find one that
* works. Ugly? Yes, absolutely, but kernel APIs are kernel APIs, so what do can we do... 馃く */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think something resembling this commit should be kept. I mean, what you are basically doing here is keeping the existing algorithm, but picking a better start value. Which I am fine with I guess.

I mean, the fact remains: the kernel does not have an api to query this. It really should, and it can change this any time it wants. And values like this do get changed every now and then.

hence, let's not claim that copying this code from the kernel would be equivalent to having an api. there really should be a better, safer, more reliable way to get this that cannot get out of sync with what the kernel does.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the fact remains: the kernel does not have an api to query this. It really should, and it can change this any time it wants. And values like this do get changed every now and then.

I agree. I plan to open a patch to the linux project where I propose that this value be exposed to applications which consume the linux public API.

hence, let's not claim that copying this code from the kernel would be equivalent to having an api.

I agree.

there really should be a better, safer, more reliable way to get this that cannot get out of sync with what the kernel does.

I agree.

/* cf. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.8#n27
* for the progeny of the below value for sysctl_nr_open_max. Note that the below logic was first
* introduced in git commit eceea0b3df05ed262ae32e0c6340cc7a3626632d of the linux kernel and was
* designed to prevent overflows when determining the maximum number of possible file descriptors. */
#define BITS_PER_LONG __WORDSIZE
#define __const_min(x, y) ((x) < (y) ? (x) : (y))
YHNdnzj marked this conversation as resolved.
Show resolved Hide resolved
unsigned v =
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can someone explain to me what `& ~BITS_PER_LONG does, I am not sure I grok this.

please use SIZE_MAX instead of ~(size_t)0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can someone explain to me what `& ~BITS_PER_LONG does, I am not sure I grok this.

It zeroes out the last ~log2(BITS_PER_LOG) bits from the value of INT_MAX. The effect is to ensure that there aren't any overflows for sysctl_nr_open_max. I really don't want to change this expression relative to the kernel source, especially since it hasn't changed in almost 16 years.

Cf. torvalds/linux@eceea0b

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use SIZE_MAX instead of ~(size_t)0

I would prefer to avoid diverging from kernel source. This expression exists as-is in fs/file.c and I'm extremely opposed to changing it. Please also see #32740 (comment) .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not so complex, please rewrite to make it follow our coding style and comment the original code from the kernel. How about something like the following?

/* From fs/file.c in the kernel,
 * unsigned sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; */
int v = CONST_MIN((size_t) INT_MAX, SIZE_MAX / sizeof(void*)) & ~(sizeof(long) * 8 - 1)

Please use int, rather than unsigned. It is limited to INT_MAX.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all due respect, one of the concerns in adopting this code is that it runs the risk of diverging from the kernel source. So, as long as we're violating the soft guideline/principal which seeks to avoid vendoring expressions from foreign codebases, I'd prefer to keep it as consistent as possible against the kernel source.

If this change set will not be adopted unless I refuse to coerce the linux expression which has stood for almost 20 years into a shape that's compatible with systemd's opinionated style guide, then I'll coerce this expression (running the risk of having this break in unexpected ways which I'm not prepared to anticipate considering I can't confirm that your expression is equivalent to linux's for all architectures and all compilers). But, I'm strongly opposed to this change, in principal, so I'd only do this under duress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't blindly copy, but respect our coding style. The suggested change is also way more readable than redefining all those macros and such.

There's no need to keep things in sync bit by bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't blindly copy.

I don't know what this means. But, if it means: "copy without consideration of what exactly it is that you're copying" then we're good - I didn't do that. I intentionally used exactly the same definitions from the kernel for consistency.

respect our coding style

I have been trying. That's why 4673cac, b4ed665 a989acf and others exist, now.

The suggested change is also way more readable than redefining all those macros and such.

That may be the case. But readability isn't my goal - consistency with the kernel source is my goal.

There's no need to keep things in sync bit by bit.

In the case of matching the kernel's definition of sysctl_nr_open_max I think there's a need to keep things in sync bit for bit. I don't want the new default value introduced by this change set to ever disagree with the kernel's - the bits should be the same. To that end, it makes sense that the definitions should correspond one for one.

I'm not a C++ expert. If you both tell me that

int v = CONST_MIN((size_t) INT_MAX, SIZE_MAX / sizeof(void*)) & ~(sizeof(long) * 8 - 1)

will always equal

unsigned sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; */

for all architectures and all compilers (with all possible combinations of flags passed to those compilers) then I'll make the change. But, considering the goal of this PR (to make the value match the kernel's) and considering my ignorance in this space and considering how reasonable assumptions are often invalidated when challenged by large parameter spaces, I'd much prefer to avoid doing that.

I'm hesitant to even consider this change since the type that I'm being asked to use (int) is even different than the kernel's (unsigned int). So, I'm skeptical that strong guarantees can be made about the relative consistency of this value across the two projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are equal (except for ~(sizeof(long) * 8 - 1) -> -(sizeof(long) * 8 - 1), should be a typo)

I'm hesitant to even consider this change since the type that I'm being asked to use (int) is even different than the kernel's (unsigned int).

It won't make any difference, since the value is bound to be smaller than INT_MAX (wrt CONST_MIN)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are equal (except for ~(sizeof(long) * 8 - 1) -> -(sizeof(long) * 8 - 1), should be a typo)

I intentionally changed that, and these should be effectively equivalent. I just want to not use negative value for unsigned type.


for (;;) {
int k;

v &= ~(__SIZEOF_POINTER__ - 1); /* Round down to next multiple of the pointer size */
if (v < 1024) {
log_warning("Can't bump fs.nr_open, value too small.");
break;
}

k = read_nr_open();
int k = read_nr_open();
if (k < 0) {
log_error_errno(k, "Failed to read fs.nr_open: %m");
break;
}
if (k >= v) { /* Already larger */

if ((unsigned)k >= v) { /* Already larger */
log_debug("Skipping bump, value is already larger.");
break;
}

r = sysctl_writef("fs/nr_open", "%i", v);
r = sysctl_writef("fs/nr_open", "%u", v);
if (r == -EINVAL) {
log_debug("Couldn't write fs.nr_open as %i, halving it.", v);
log_debug("Couldn't write fs.nr_open as %u, halving it.", v);
v /= 2;
continue;
}
Expand All @@ -1292,7 +1290,7 @@ static void bump_file_max_and_nr_open(void) {
break;
}

log_debug("Successfully bumped fs.nr_open to %i", v);
log_debug("Successfully bumped fs.nr_open to %u", v);
break;
}
#endif
Expand Down