New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use Linux's value for sysctl_nr_open_max
#32740
base: main
Are you sure you want to change the base?
fix: use Linux's value for sysctl_nr_open_max
#32740
Conversation
Important An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released. |
3044d52
to
092b4aa
Compare
This comment was marked as duplicate.
This comment was marked as duplicate.
a7bb7cf
to
d43574e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read https://systemd.io/CODING_STYLE/ first.
* and is something we don't really want to copy to userspace, as it is dependent on implementation | ||
* details of the kernel. Since the kernel doesn't expose the maximum value to us, we can only try | ||
* and hope. Hence, let's start with INT_MAX, and then keep halving the value until we find one that | ||
* works. Ugly? Yes, absolutely, but kernel APIs are kernel APIs, so what do can we do... 馃く */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think something resembling this commit should be kept. I mean, what you are basically doing here is keeping the existing algorithm, but picking a better start value. Which I am fine with I guess.
I mean, the fact remains: the kernel does not have an api to query this. It really should, and it can change this any time it wants. And values like this do get changed every now and then.
hence, let's not claim that copying this code from the kernel would be equivalent to having an api. there really should be a better, safer, more reliable way to get this that cannot get out of sync with what the kernel does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the fact remains: the kernel does not have an api to query this. It really should, and it can change this any time it wants. And values like this do get changed every now and then.
I agree. I plan to open a patch to the linux
project where I propose that this value be exposed to applications which consume the linux
public API.
hence, let's not claim that copying this code from the kernel would be equivalent to having an api.
I agree.
there really should be a better, safer, more reliable way to get this that cannot get out of sync with what the kernel does.
I agree.
#define BITS_PER_LONG __WORDSIZE | ||
#define __const_min(x, y) ((x) < (y) ? (x) : (y)) | ||
unsigned int sysctl_nr_open_max = | ||
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can someone explain to me what `& ~BITS_PER_LONG does, I am not sure I grok this.
please use SIZE_MAX instead of ~(size_t)0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can someone explain to me what `& ~BITS_PER_LONG does, I am not sure I grok this.
It zeroes out the last ~log2(BITS_PER_LOG)
bits from the value of INT_MAX
. The effect is to ensure that there aren't any overflows for sysctl_nr_open_max
. I really don't want to change this expression relative to the kernel source, especially since it hasn't changed in almost 16 years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use SIZE_MAX instead of ~(size_t)0
I would prefer to avoid diverging from kernel source. This expression exists as-is in fs/file.c
and I'm extremely opposed to changing it. Please also see #32740 (comment) .
1c2bf32
to
0604d5d
Compare
Well, it's too late to read it first, I guess. But, I'm taking a look. Thanks! |
f2576fd
to
89ccd41
Compare
a62463c
to
7ee0a15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic concept looks good to me, but please follow our coding style.
#define BITS_PER_LONG __WORDSIZE | ||
#define __const_min(x, y) ((x) < (y) ? (x) : (y)) | ||
unsigned sysctl_nr_open_max = | ||
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not so complex, please rewrite to make it follow our coding style and comment the original code from the kernel. How about something like the following?
/* From fs/file.c in the kernel,
* unsigned sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; */
int v = CONST_MIN((size_t) INT_MAX, SIZE_MAX / sizeof(void*)) & ~(sizeof(long) * 8 - 1)
Please use int
, rather than unsigned
. It is limited to INT_MAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all due respect, one of the concerns in adopting this code is that it runs the risk of diverging from the kernel source. So, as long as we're violating the soft guideline/principal which seeks to avoid vendoring expressions from foreign codebases, I'd prefer to keep it as consistent as possible against the kernel source.
If this change set will not be adopted unless I refuse to coerce the linux
expression which has stood for almost 20 years into a shape that's compatible with systemd
's opinionated style guide, then I'll coerce this expression (running the risk of having this break in unexpected ways which I'm not prepared to anticipate considering I can't confirm that your expression is equivalent to linux
's for all architectures and all compilers). But, I'm strongly opposed to this change, in principal, so I'd only do this under duress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't blindly copy, but respect our coding style. The suggested change is also way more readable than redefining all those macros and such.
There's no need to keep things in sync bit by bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't blindly copy.
I don't know what this means. But, if it means: "copy without consideration of what exactly it is that you're copying" then we're good - I didn't do that. I intentionally used exactly the same definitions from the kernel for consistency.
respect our coding style
I have been trying. That's why 4673cac, b4ed665 a989acf and others exist, now.
The suggested change is also way more readable than redefining all those macros and such.
That may be the case. But readability isn't my goal - consistency with the kernel source is my goal.
There's no need to keep things in sync bit by bit.
In the case of matching the kernel's definition of sysctl_nr_open_max
I think there's a need to keep things in sync bit for bit. I don't want the new default value introduced by this change set to ever disagree with the kernel's - the bits should be the same. To that end, it makes sense that the definitions should correspond one for one.
I'm not a C++ expert. If you both tell me that
int v = CONST_MIN((size_t) INT_MAX, SIZE_MAX / sizeof(void*)) & ~(sizeof(long) * 8 - 1)
will always equal
unsigned sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; */
for all architectures and all compilers (with all possible combinations of flags passed to those compilers) then I'll make the change. But, considering the goal of this PR (to make the value match the kernel's) and considering my ignorance in this space and considering how reasonable assumptions are often invalidated when challenged by large parameter spaces, I'd much prefer to avoid doing that.
I'm hesitant to even consider this change since the type that I'm being asked to use (int
) is even different than the kernel's (unsigned int
). So, I'm skeptical that strong guarantees can be made about the relative consistency of this value across the two projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are equal (except for ~(sizeof(long) * 8 - 1)
-> -(sizeof(long) * 8 - 1)
, should be a typo)
I'm hesitant to even consider this change since the type that I'm being asked to use (int) is even different than the kernel's (unsigned int).
It won't make any difference, since the value is bound to be smaller than INT_MAX
(wrt CONST_MIN
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are equal (except for
~(sizeof(long) * 8 - 1)
->-(sizeof(long) * 8 - 1)
, should be a typo)
I intentionally changed that, and these should be effectively equivalent. I just want to not use negative value for unsigned type.
src/core/main.c
Outdated
unsigned sysctl_nr_open_max = | ||
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; | ||
|
||
unsigned v = sysctl_nr_open_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is completely redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. I don't normally write C/C++. How is this redundant and how should I initialize the value of v
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/core/main.c
Outdated
log_error_errno(k, "Failed to read fs.nr_open: %m"); | ||
int nr_open = read_nr_open(); | ||
if (nr_open < 0) { | ||
log_error_errno(nr_open, "Failed to read fs.nr_open: %m"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to update here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying you want me to pull this block above (outside of) the loop? I can do that, but this would change the behavior relative to what exists today in main
:
Lines 1274 to 1278 in ce78bae
k = read_nr_open(); | |
if (k < 0) { | |
log_error_errno(k, "Failed to read fs.nr_open: %m"); | |
break; | |
} |
It also would mean that this PR would do more than simply change the default value of v
and, potentially (although unlikely), would introduce a different failure mode than this proposed change set. I'd prefer that work to be separated into a different change set (PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, maybe I understand what you meant. I submitted 09700f2 to try to address this.
While Lennart moaned, a bit, in the comments about vendoring this expression and this value being unknowable, it hasn't changed in over 10 years (and even then the change was a refactor to define this value statically instead of defining it via delayed execution, cf. 7f4b36f9bb930b3b2105a9a2cb0121fa7028c432 of the kernel source) so it could be argued that it's pretty knowable (even if it isn't available through public kernel headers like it might ought to be). The previous implementation of `bump_file_max_and_nr_open` isn't great, since the initial value is a bit higher than the maximum allowed value by the kernel which triggers a cut down to half of the initial value. We ran into this in production while trying to increase system resource limits and found that PID 1 wasn't respecting our sysctl.nr_file values. Enabling debug logs using `systemd.log_level=debug` we observed ``` May 10 02:46:45 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 > May 10 02:46:45 localhost systemd[1]: Detected virtualization amazon. May 10 02:46:45 localhost systemd[1]: Detected architecture x86-64. May 10 02:46:45 localhost systemd[1]: Detected initialized system, this is not the first boot. May 10 02:46:45 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15 May 10 02:46:45 localhost systemd[1]: No hostname configured, using default hostname. May 10 02:46:45 localhost systemd[1]: Hostname set to <localhost>. May 10 02:46:45 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface May 10 02:46:45 localhost systemd[1]: Successfully added address ::1 to loopback interface May 10 02:46:45 localhost systemd[1]: Successfully brought loopback interface up May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807' May 10 02:46:45 localhost systemd[1]: Initial value of v is 2147483584. May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483584' May 10 02:46:45 localhost systemd[1]: Successfully bumped fs.nr_open to 2147483584 May 10 02:46:45 localhost systemd[1]: No credentials passed via fw_cfg. ``` including this change while we observed ``` May 10 03:31:37 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA > May 10 03:31:37 localhost systemd[1]: Detected virtualization amazon. May 10 03:31:37 localhost systemd[1]: Detected architecture x86-64. May 10 03:31:37 localhost systemd[1]: Detected initialized system, this is not the first boot. May 10 03:31:37 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15 May 10 03:31:37 localhost systemd[1]: No hostname configured, using default hostname. May 10 03:31:37 localhost systemd[1]: Hostname set to <localhost>. May 10 03:31:37 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface May 10 03:31:37 localhost systemd[1]: Successfully added address ::1 to loopback interface May 10 03:31:37 localhost systemd[1]: Successfully brought loopback interface up May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807' May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483640' May 10 03:31:37 localhost systemd[1]: Couldn't write fs.nr_open as 2147483640, halving it. May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '1073741816' May 10 03:31:37 localhost systemd[1]: Successfully bumped fs.nr_open to 1073741816 May 10 03:31:37 localhost systemd[1]: No credentials passed via fw_cfg. ``` without this patch. We also observed ``` $ cat /proc/1/limits | grep files Max open files 2147483584 2147483584 files ``` with this patch, but ``` $ cat /proc/1/limits | grep files Max open files 1073741816 1073741816 files ``` without this patch. Below is a short program that anyone can use to determine what the value of `sysctl_nr_open_max` should be for their architecture. ```cpp // Type your code here, or load an example. /* sysctl_nr_open_max defines the largest value for which you can do something like ``` $ sysctl -w fs.nr_open=10000 fs.nr_open = 10000 ``` The value is apparently architecture-dependent and is located at https://github.com/torvalds/linux/blob/448b3fe5a0eab5b625a7e15c67c7972169e47ff8/fs/file.c#L27-L32 This little snippet, below, yields this value for different architectures. It's mostly an experiment with using Godbolt (C++ not C) to confirm certain Linux results. */ // cf. https://stackoverflow.com/a/66249936 extern "C" { const char *getBuild() { // Get current architecture, detectx nearly every // architecture. Coded by Freak return "x86_64"; return "x86_32"; return "ARM2"; return "ARM3"; return "ARM4T"; return "ARM5" return "ARM6T2"; defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) || \ defined(__ARM_ARCH_6ZK__) return "ARM6"; defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) || \ defined(__ARM_ARCH_7S__) return "ARM7"; defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7S__) return "ARM7A"; defined(__ARM_ARCH_7S__) return "ARM7R"; return "ARM7M"; return "ARM7S"; return "ARM64"; return "MIPS"; return "SUPERH"; defined(__POWERPC__) || defined(__ppc__) || defined(__PPC__) || \ defined(_ARCH_PPC) return "POWERPC"; return "POWERPC64"; return "SPARC"; return "M68K"; return "UNKNOWN"; } } // __WORDSIZE defined by GCC/llvm void fn() { const char *system = getBuild(); std::printf("==== Results for %s ====\n\n", system); std::printf("INT_MAX (%s): %x (%d)\n", system, INT_MAX, INT_MAX); unsigned long native_size = ~(size_t)0 / sizeof(void *); std::printf("~(size_t)0/sizeof(void *): %lx (%ld)\n", native_size, native_size); unsigned long min = __const_min(INT_MAX, ~(size_t)0 / sizeof(void *)); std::printf("Minimum of those two things: %lx (%ld)\n", min, min); std::cout << "BITS_PER_LONG: " << BITS_PER_LONG << std::endl; unsigned int sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0 / sizeof(void *)) & -BITS_PER_LONG; printf("sysctl_nr_open_max: %d", sysctl_nr_open_max); } int main() { fn(); return 0; } ```
feae976
to
19e7859
Compare
f1f10b0
to
09700f2
Compare
While Lennart discussed, a bit, in the comments about vendoring this expression and this value being unknowable, it hasn't changed in over 10 years (and even then the change was a refactor to define this value statically instead of defining it via delayed execution, cf. torvalds/linux@7f4b36f) so it could be argued that it's pretty knowable (even if it isn't available through public kernel headers like it might ought to be). The previous implementation of
bump_file_max_and_nr_open
isn't great, since the initial value is a bit higher than the maximum allowed value by the kernel which triggers a cut down to half of the initial value. We ran into this in production while trying to increase system resource limits and found that PID 1 wasn't respecting oursysctl
nr_file
values. Enabling debug logs usingsystemd.log_level=debug
we observedincluding this change while we observed
without this patch. We also observed
with this patch, but
without this patch.
Please consider this patch for acceptance despite past trepidation to perform this logic! 馃檹