Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use Linux's value for sysctl_nr_open_max #32740

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

johnrichardrinehart
Copy link

@johnrichardrinehart johnrichardrinehart commented May 10, 2024

While Lennart discussed, a bit, in the comments about vendoring this expression and this value being unknowable, it hasn't changed in over 10 years (and even then the change was a refactor to define this value statically instead of defining it via delayed execution, cf. torvalds/linux@7f4b36f) so it could be argued that it's pretty knowable (even if it isn't available through public kernel headers like it might ought to be). The previous implementation of bump_file_max_and_nr_open isn't great, since the initial value is a bit higher than the maximum allowed value by the kernel which triggers a cut down to half of the initial value. We ran into this in production while trying to increase system resource limits and found that PID 1 wasn't respecting our sysctl nr_file values. Enabling debug logs using systemd.log_level=debug we observed

May 10 02:46:45 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 >
May 10 02:46:45 localhost systemd[1]: Detected virtualization amazon.
May 10 02:46:45 localhost systemd[1]: Detected architecture x86-64.
May 10 02:46:45 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 02:46:45 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 02:46:45 localhost systemd[1]: No hostname configured, using default hostname.
May 10 02:46:45 localhost systemd[1]: Hostname set to <localhost>.
May 10 02:46:45 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully brought loopback interface up
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 02:46:45 localhost systemd[1]: Initial value of v is 2147483584.
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483584'
May 10 02:46:45 localhost systemd[1]: Successfully bumped fs.nr_open to 2147483584
May 10 02:46:45 localhost systemd[1]: No credentials passed via fw_cfg.

including this change while we observed

May 10 03:31:37 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA >
May 10 03:31:37 localhost systemd[1]: Detected virtualization amazon.
May 10 03:31:37 localhost systemd[1]: Detected architecture x86-64.
May 10 03:31:37 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 03:31:37 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 03:31:37 localhost systemd[1]: No hostname configured, using default hostname.
May 10 03:31:37 localhost systemd[1]: Hostname set to <localhost>.
May 10 03:31:37 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully brought loopback interface up
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483640'
May 10 03:31:37 localhost systemd[1]: Couldn't write fs.nr_open as 2147483640, halving it.
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '1073741816'
May 10 03:31:37 localhost systemd[1]: Successfully bumped fs.nr_open to 1073741816
May 10 03:31:37 localhost systemd[1]: No credentials passed via fw_cfg.

without this patch. We also observed

$ cat /proc/1/limits | grep files
Max open files            2147483584           2147483584           files

with this patch, but

$ cat /proc/1/limits | grep files
Max open files            1073741816           1073741816           files

without this patch.

Please consider this patch for acceptance despite past trepidation to perform this logic! 馃檹

@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label May 10, 2024
Copy link

Important

An -rc1 tag has been created and a release is being prepared, so please note that PRs introducing new features and APIs will be held back until the new version has been released.

@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch from 3044d52 to 092b4aa Compare May 10, 2024 04:01

This comment was marked as duplicate.

@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch 5 times, most recently from a7bb7cf to d43574e Compare May 10, 2024 04:35
Copy link
Member

@YHNdnzj YHNdnzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/core/main.c Outdated Show resolved Hide resolved
src/core/main.c Outdated Show resolved Hide resolved
src/core/main.c Show resolved Hide resolved
@YHNdnzj YHNdnzj added pid1 reviewed/needs-rework 馃敤 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels May 10, 2024
* and is something we don't really want to copy to userspace, as it is dependent on implementation
* details of the kernel. Since the kernel doesn't expose the maximum value to us, we can only try
* and hope. Hence, let's start with INT_MAX, and then keep halving the value until we find one that
* works. Ugly? Yes, absolutely, but kernel APIs are kernel APIs, so what do can we do... 馃く */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think something resembling this commit should be kept. I mean, what you are basically doing here is keeping the existing algorithm, but picking a better start value. Which I am fine with I guess.

I mean, the fact remains: the kernel does not have an api to query this. It really should, and it can change this any time it wants. And values like this do get changed every now and then.

hence, let's not claim that copying this code from the kernel would be equivalent to having an api. there really should be a better, safer, more reliable way to get this that cannot get out of sync with what the kernel does.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the fact remains: the kernel does not have an api to query this. It really should, and it can change this any time it wants. And values like this do get changed every now and then.

I agree. I plan to open a patch to the linux project where I propose that this value be exposed to applications which consume the linux public API.

hence, let's not claim that copying this code from the kernel would be equivalent to having an api.

I agree.

there really should be a better, safer, more reliable way to get this that cannot get out of sync with what the kernel does.

I agree.

#define BITS_PER_LONG __WORDSIZE
#define __const_min(x, y) ((x) < (y) ? (x) : (y))
unsigned int sysctl_nr_open_max =
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can someone explain to me what `& ~BITS_PER_LONG does, I am not sure I grok this.

please use SIZE_MAX instead of ~(size_t)0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can someone explain to me what `& ~BITS_PER_LONG does, I am not sure I grok this.

It zeroes out the last ~log2(BITS_PER_LOG) bits from the value of INT_MAX. The effect is to ensure that there aren't any overflows for sysctl_nr_open_max. I really don't want to change this expression relative to the kernel source, especially since it hasn't changed in almost 16 years.

Cf. torvalds/linux@eceea0b

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use SIZE_MAX instead of ~(size_t)0

I would prefer to avoid diverging from kernel source. This expression exists as-is in fs/file.c and I'm extremely opposed to changing it. Please also see #32740 (comment) .

src/core/main.c Outdated Show resolved Hide resolved
src/core/main.c Outdated Show resolved Hide resolved
src/core/main.c Outdated Show resolved Hide resolved
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 馃敤 PR has been reviewed and needs another round of reworks labels May 10, 2024
@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch 2 times, most recently from 1c2bf32 to 0604d5d Compare May 10, 2024 18:23
@johnrichardrinehart
Copy link
Author

Please read https://systemd.io/CODING_STYLE/ first.

Well, it's too late to read it first, I guess. But, I'm taking a look. Thanks!

@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch from f2576fd to 89ccd41 Compare May 10, 2024 18:25
@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch 3 times, most recently from a62463c to 7ee0a15 Compare May 10, 2024 18:45
Copy link
Member

@yuwata yuwata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic concept looks good to me, but please follow our coding style.

src/core/main.c Outdated Show resolved Hide resolved
#define BITS_PER_LONG __WORDSIZE
#define __const_min(x, y) ((x) < (y) ? (x) : (y))
unsigned sysctl_nr_open_max =
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not so complex, please rewrite to make it follow our coding style and comment the original code from the kernel. How about something like the following?

/* From fs/file.c in the kernel,
 * unsigned sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; */
int v = CONST_MIN((size_t) INT_MAX, SIZE_MAX / sizeof(void*)) & ~(sizeof(long) * 8 - 1)

Please use int, rather than unsigned. It is limited to INT_MAX.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all due respect, one of the concerns in adopting this code is that it runs the risk of diverging from the kernel source. So, as long as we're violating the soft guideline/principal which seeks to avoid vendoring expressions from foreign codebases, I'd prefer to keep it as consistent as possible against the kernel source.

If this change set will not be adopted unless I refuse to coerce the linux expression which has stood for almost 20 years into a shape that's compatible with systemd's opinionated style guide, then I'll coerce this expression (running the risk of having this break in unexpected ways which I'm not prepared to anticipate considering I can't confirm that your expression is equivalent to linux's for all architectures and all compilers). But, I'm strongly opposed to this change, in principal, so I'd only do this under duress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't blindly copy, but respect our coding style. The suggested change is also way more readable than redefining all those macros and such.

There's no need to keep things in sync bit by bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't blindly copy.

I don't know what this means. But, if it means: "copy without consideration of what exactly it is that you're copying" then we're good - I didn't do that. I intentionally used exactly the same definitions from the kernel for consistency.

respect our coding style

I have been trying. That's why 4673cac, b4ed665 a989acf and others exist, now.

The suggested change is also way more readable than redefining all those macros and such.

That may be the case. But readability isn't my goal - consistency with the kernel source is my goal.

There's no need to keep things in sync bit by bit.

In the case of matching the kernel's definition of sysctl_nr_open_max I think there's a need to keep things in sync bit for bit. I don't want the new default value introduced by this change set to ever disagree with the kernel's - the bits should be the same. To that end, it makes sense that the definitions should correspond one for one.

I'm not a C++ expert. If you both tell me that

int v = CONST_MIN((size_t) INT_MAX, SIZE_MAX / sizeof(void*)) & ~(sizeof(long) * 8 - 1)

will always equal

unsigned sysctl_nr_open_max = __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; */

for all architectures and all compilers (with all possible combinations of flags passed to those compilers) then I'll make the change. But, considering the goal of this PR (to make the value match the kernel's) and considering my ignorance in this space and considering how reasonable assumptions are often invalidated when challenged by large parameter spaces, I'd much prefer to avoid doing that.

I'm hesitant to even consider this change since the type that I'm being asked to use (int) is even different than the kernel's (unsigned int). So, I'm skeptical that strong guarantees can be made about the relative consistency of this value across the two projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are equal (except for ~(sizeof(long) * 8 - 1) -> -(sizeof(long) * 8 - 1), should be a typo)

I'm hesitant to even consider this change since the type that I'm being asked to use (int) is even different than the kernel's (unsigned int).

It won't make any difference, since the value is bound to be smaller than INT_MAX (wrt CONST_MIN)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are equal (except for ~(sizeof(long) * 8 - 1) -> -(sizeof(long) * 8 - 1), should be a typo)

I intentionally changed that, and these should be effectively equivalent. I just want to not use negative value for unsigned type.

src/core/main.c Outdated
unsigned sysctl_nr_open_max =
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG;

unsigned v = sysctl_nr_open_max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely redundant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. I don't normally write C/C++. How is this redundant and how should I initialize the value of v?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to address this in feae976 but I'm not exactly sure what you want @yuwata .

src/core/main.c Outdated
log_error_errno(k, "Failed to read fs.nr_open: %m");
int nr_open = read_nr_open();
if (nr_open < 0) {
log_error_errno(nr_open, "Failed to read fs.nr_open: %m");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to update here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying you want me to pull this block above (outside of) the loop? I can do that, but this would change the behavior relative to what exists today in main:

systemd/src/core/main.c

Lines 1274 to 1278 in ce78bae

k = read_nr_open();
if (k < 0) {
log_error_errno(k, "Failed to read fs.nr_open: %m");
break;
}

It also would mean that this PR would do more than simply change the default value of v and, potentially (although unlikely), would introduce a different failure mode than this proposed change set. I'd prefer that work to be separated into a different change set (PR).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, maybe I understand what you meant. I submitted 09700f2 to try to address this.

@yuwata yuwata added reviewed/needs-rework 馃敤 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels May 11, 2024
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 馃敤 PR has been reviewed and needs another round of reworks labels May 11, 2024
While Lennart moaned, a bit, in the comments about vendoring this
expression and this value being unknowable, it hasn't changed in over 10
years (and even then the change was a refactor to define this value
statically instead of defining it via delayed execution, cf.
7f4b36f9bb930b3b2105a9a2cb0121fa7028c432 of the kernel source) so it
could be argued that it's pretty knowable (even if it isn't available
through public kernel headers like it might ought to be). The previous
implementation of `bump_file_max_and_nr_open` isn't great, since the
initial value is a bit higher than the maximum allowed value by the
kernel which triggers a cut down to half of the initial value. We ran
into this in production while trying to increase system resource limits
and found that PID 1 wasn't respecting our sysctl.nr_file values.
Enabling debug logs using `systemd.log_level=debug` we observed
```
May 10 02:46:45 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 >
May 10 02:46:45 localhost systemd[1]: Detected virtualization amazon.
May 10 02:46:45 localhost systemd[1]: Detected architecture x86-64.
May 10 02:46:45 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 02:46:45 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 02:46:45 localhost systemd[1]: No hostname configured, using default hostname.
May 10 02:46:45 localhost systemd[1]: Hostname set to <localhost>.
May 10 02:46:45 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully brought loopback interface up
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 02:46:45 localhost systemd[1]: Initial value of v is 2147483584.
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483584'
May 10 02:46:45 localhost systemd[1]: Successfully bumped fs.nr_open to 2147483584
May 10 02:46:45 localhost systemd[1]: No credentials passed via fw_cfg.
```
including this change while we observed
```
May 10 03:31:37 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA >
May 10 03:31:37 localhost systemd[1]: Detected virtualization amazon.
May 10 03:31:37 localhost systemd[1]: Detected architecture x86-64.
May 10 03:31:37 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 03:31:37 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 03:31:37 localhost systemd[1]: No hostname configured, using default hostname.
May 10 03:31:37 localhost systemd[1]: Hostname set to <localhost>.
May 10 03:31:37 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully brought loopback interface up
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483640'
May 10 03:31:37 localhost systemd[1]: Couldn't write fs.nr_open as 2147483640, halving it.
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '1073741816'
May 10 03:31:37 localhost systemd[1]: Successfully bumped fs.nr_open to 1073741816
May 10 03:31:37 localhost systemd[1]: No credentials passed via fw_cfg.
```
without this patch. We also observed
```
$ cat /proc/1/limits | grep files
Max open files            2147483584           2147483584           files
```
with this patch, but
```
$ cat /proc/1/limits | grep files
Max open files            1073741816           1073741816           files
```
without this patch. Below is a short program that anyone can use to
determine what the value of `sysctl_nr_open_max` should be for their
architecture.
```cpp
// Type your code here, or load an example.

/*
sysctl_nr_open_max defines the largest value for which you can do something like
```
$ sysctl -w fs.nr_open=10000
fs.nr_open = 10000
```

The value is apparently architecture-dependent and is located at
https://github.com/torvalds/linux/blob/448b3fe5a0eab5b625a7e15c67c7972169e47ff8/fs/file.c#L27-L32

This little snippet, below, yields this value for different architectures.

It's mostly an experiment with using Godbolt (C++ not C) to confirm certain
Linux results.
*/

// cf. https://stackoverflow.com/a/66249936
extern "C" {
const char *getBuild() {  // Get current architecture, detectx nearly every
                          // architecture. Coded by Freak
    return "x86_64";
    return "x86_32";
    return "ARM2";
    return "ARM3";
    return "ARM4T";
    return "ARM5"
    return "ARM6T2";
    defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) ||  \
    defined(__ARM_ARCH_6ZK__)
    return "ARM6";
    defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) ||  \
    defined(__ARM_ARCH_7S__)
    return "ARM7";
    defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7S__)
    return "ARM7A";
    defined(__ARM_ARCH_7S__)
    return "ARM7R";
    return "ARM7M";
    return "ARM7S";
    return "ARM64";
    return "MIPS";
    return "SUPERH";
    defined(__POWERPC__) || defined(__ppc__) || defined(__PPC__) ||           \
    defined(_ARCH_PPC)
    return "POWERPC";
    return "POWERPC64";
    return "SPARC";
    return "M68K";
    return "UNKNOWN";
}
}

// __WORDSIZE defined by GCC/llvm

void fn() {
    const char *system = getBuild();

    std::printf("==== Results for %s ====\n\n", system);

    std::printf("INT_MAX (%s): %x (%d)\n", system, INT_MAX, INT_MAX);

    unsigned long native_size = ~(size_t)0 / sizeof(void *);
    std::printf("~(size_t)0/sizeof(void *): %lx (%ld)\n", native_size,
                native_size);

    unsigned long min = __const_min(INT_MAX, ~(size_t)0 / sizeof(void *));
    std::printf("Minimum of those two things: %lx (%ld)\n", min, min);

    std::cout << "BITS_PER_LONG: " << BITS_PER_LONG << std::endl;

    unsigned int sysctl_nr_open_max =
        __const_min(INT_MAX, ~(size_t)0 / sizeof(void *)) & -BITS_PER_LONG;

    printf("sysctl_nr_open_max: %d", sysctl_nr_open_max);
}

int main() {
    fn();

    return 0;
}
```
@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch from feae976 to 19e7859 Compare May 11, 2024 01:02
@johnrichardrinehart johnrichardrinehart force-pushed the jrinehart/match-kernel-ulimit-fs.nr_open branch from f1f10b0 to 09700f2 Compare May 11, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pid1 please-review PR is ready for (re-)review by a maintainer squash-on-merge
Development

Successfully merging this pull request may close these issues.

None yet

4 participants