Skip to content

Commit

Permalink
fix: use Linux's value for sysctl_nr_open_max
Browse files Browse the repository at this point in the history
While Lennart moaned, a bit, in the comments about vendoring this
expression and this value being unknowable, it hasn't changed in over 10
years (and even then the change was a refactor to define this value
statically instead of defining it via delayed execution, cf.
7f4b36f9bb930b3b2105a9a2cb0121fa7028c432 of the kernel source) so it
could be argued that it's pretty knowable (even if it isn't available
through public kernel headers like it might ought to be). The previous
implementation of `bump_file_max_and_nr_open` isn't great, since the
initial value is a bit higher than the maximum allowed value by the
kernel which triggers a cut down to half of the initial value. We ran
into this in production while trying to increase system resource limits
and found that PID 1 wasn't respecting our sysctl.nr_file values.
Enabling debug logs using `systemd.log_level=debug` we observed
```
May 10 02:46:45 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 >
May 10 02:46:45 localhost systemd[1]: Detected virtualization amazon.
May 10 02:46:45 localhost systemd[1]: Detected architecture x86-64.
May 10 02:46:45 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 02:46:45 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 02:46:45 localhost systemd[1]: No hostname configured, using default hostname.
May 10 02:46:45 localhost systemd[1]: Hostname set to <localhost>.
May 10 02:46:45 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully brought loopback interface up
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 02:46:45 localhost systemd[1]: Initial value of v is 2147483584.
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483584'
May 10 02:46:45 localhost systemd[1]: Successfully bumped fs.nr_open to 2147483584
May 10 02:46:45 localhost systemd[1]: No credentials passed via fw_cfg.
```
including this change while we observed
```
May 10 03:31:37 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA >
May 10 03:31:37 localhost systemd[1]: Detected virtualization amazon.
May 10 03:31:37 localhost systemd[1]: Detected architecture x86-64.
May 10 03:31:37 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 03:31:37 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 03:31:37 localhost systemd[1]: No hostname configured, using default hostname.
May 10 03:31:37 localhost systemd[1]: Hostname set to <localhost>.
May 10 03:31:37 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully brought loopback interface up
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483640'
May 10 03:31:37 localhost systemd[1]: Couldn't write fs.nr_open as 2147483640, halving it.
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '1073741816'
May 10 03:31:37 localhost systemd[1]: Successfully bumped fs.nr_open to 1073741816
May 10 03:31:37 localhost systemd[1]: No credentials passed via fw_cfg.
```
without this patch. We also observed
```
$ cat /proc/1/limits | grep files
Max open files            2147483584           2147483584           files
```
with this patch, but
```
$ cat /proc/1/limits | grep files
Max open files            1073741816           1073741816           files
```
without this patch. Below is a short program that anyone can use to
determine what the value of `sysctl_nr_open_max` should be for their
architecture.
```cpp
// Type your code here, or load an example.

/*
sysctl_nr_open_max defines the largest value for which you can do something like
```
$ sysctl -w fs.nr_open=10000
fs.nr_open = 10000
```

The value is apparently architecture-dependent and is located at
https://github.com/torvalds/linux/blob/448b3fe5a0eab5b625a7e15c67c7972169e47ff8/fs/file.c#L27-L32

This little snippet, below, yields this value for different architectures.

It's mostly an experiment with using Godbolt (C++ not C) to confirm certain
Linux results.
*/

// cf. https://stackoverflow.com/a/66249936
extern "C" {
const char *getBuild() {  // Get current architecture, detectx nearly every
                          // architecture. Coded by Freak
    return "x86_64";
    return "x86_32";
    return "ARM2";
    return "ARM3";
    return "ARM4T";
    return "ARM5"
    return "ARM6T2";
    defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) ||  \
    defined(__ARM_ARCH_6ZK__)
    return "ARM6";
    defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) ||  \
    defined(__ARM_ARCH_7S__)
    return "ARM7";
    defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7S__)
    return "ARM7A";
    defined(__ARM_ARCH_7S__)
    return "ARM7R";
    return "ARM7M";
    return "ARM7S";
    return "ARM64";
    return "MIPS";
    return "SUPERH";
    defined(__POWERPC__) || defined(__ppc__) || defined(__PPC__) ||           \
    defined(_ARCH_PPC)
    return "POWERPC";
    return "POWERPC64";
    return "SPARC";
    return "M68K";
    return "UNKNOWN";
}
}

// __WORDSIZE defined by GCC/llvm

void fn() {
    const char *system = getBuild();

    std::printf("==== Results for %s ====\n\n", system);

    std::printf("INT_MAX (%s): %x (%d)\n", system, INT_MAX, INT_MAX);

    unsigned long native_size = ~(size_t)0 / sizeof(void *);
    std::printf("~(size_t)0/sizeof(void *): %lx (%ld)\n", native_size,
                native_size);

    unsigned long min = __const_min(INT_MAX, ~(size_t)0 / sizeof(void *));
    std::printf("Minimum of those two things: %lx (%ld)\n", min, min);

    std::cout << "BITS_PER_LONG: " << BITS_PER_LONG << std::endl;

    unsigned int sysctl_nr_open_max =
        __const_min(INT_MAX, ~(size_t)0 / sizeof(void *)) & -BITS_PER_LONG;

    printf("sysctl_nr_open_max: %d", sysctl_nr_open_max);
}

int main() {
    fn();

    return 0;
}
```
  • Loading branch information
johnrichardrinehart committed May 10, 2024
1 parent 41a94ae commit d43574e
Showing 1 changed file with 17 additions and 15 deletions.
32 changes: 17 additions & 15 deletions src/core/main.c
Original file line number Diff line number Diff line change
Expand Up @@ -1253,37 +1253,39 @@ static void bump_file_max_and_nr_open(void) {
#endif

#if BUMP_PROC_SYS_FS_NR_OPEN
int v = INT_MAX;
unsigned int v = INT_MAX;

/* Argh! The kernel enforces maximum and minimum values on the fs.nr_open, but we don't really know
* what they are. The expression by which the maximum is determined is dependent on the architecture,
* and is something we don't really want to copy to userspace, as it is dependent on implementation
* details of the kernel. Since the kernel doesn't expose the maximum value to us, we can only try
* and hope. Hence, let's start with INT_MAX, and then keep halving the value until we find one that
* works. Ugly? Yes, absolutely, but kernel APIs are kernel APIs, so what do can we do... 🤯 */
// cf. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.8#n27
#define BITS_PER_LONG __WORDSIZE
#define __const_min(x, y) ((x) < (y) ? (x) : (y))
unsigned int sysctl_nr_open_max =
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG;

for (;;) {
int k;
unsigned int k;

v &= sysctl_nr_open_max;

v &= ~(__SIZEOF_POINTER__ - 1); /* Round down to next multiple of the pointer size */
if (v < 1024) {
log_warning("Can't bump fs.nr_open, value too small.");
break;
}

k = read_nr_open();
if (k < 0) {
log_error_errno(k, "Failed to read fs.nr_open: %m");
int nr_open = read_nr_open();
if (nr_open < 0) {
log_error_errno(nr_open, "Failed to read fs.nr_open: %m");
break;
}

k = (unsigned int)(nr_open);
if (k >= v) { /* Already larger */
log_debug("Skipping bump, value is already larger.");
break;
}

r = sysctl_writef("fs/nr_open", "%i", v);
r = sysctl_writef("fs/nr_open", "%u", v);
if (r == -EINVAL) {
log_debug("Couldn't write fs.nr_open as %i, halving it.", v);
log_debug("Couldn't write fs.nr_open as %u, halving it.", v);
v /= 2;
continue;
}
Expand All @@ -1292,7 +1294,7 @@ static void bump_file_max_and_nr_open(void) {
break;
}

log_debug("Successfully bumped fs.nr_open to %i", v);
log_debug("Successfully bumped fs.nr_open to %u", v);
break;
}
#endif
Expand Down

0 comments on commit d43574e

Please sign in to comment.