Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance overhead when timing requires a syscall on Linux #270

Open
J08nY opened this issue Oct 6, 2023 · 18 comments
Open

Comments

@J08nY
Copy link

J08nY commented Oct 6, 2023

Hi, I recently saw a performance regression while using pyinstrument. It seems to be connected to the 4.5.1 -> 4.5.3 update but also might be related to Python version (3.9.17 to 3.9.18).
Update: This is likely related to sympy library use.

I have two Github CI runs which use pyinstrument to perf some code that uses sympy.

Good run

https://github.com/J08nY/pyecsca/actions/runs/6039852294/job/16389425413

In the good run, pyinstrument is 4.5.1 and Python is 3.9.17 and the logs look reasonable
(look here for example, it takes about 1.2 seconds).

Bad run

https://github.com/J08nY/pyecsca/actions/runs/6247569280/job/16960430761

In the bad run, pyinstrument is 4.5.3 and Python is 3.9.18 and the logs look very long
(look here for example, it now takes 13.6 seconds). I understand the logs are quite verbose since I have the show_all option
enabled, but that does not explain the runtime increase.

Local test

Note that there are some additional calls to sympy added in commits between the two runs.
However I believe these are not the cause of the performance regression, since I can observe the regression
locally.

Running the code without pyinstrument I get about 5 second runtime:

start = datetime.now()
for _ in range(operations):
    one_point = dbl(p256.curve.prime, one_point, **p256.curve.parameters)[0]
end = datetime.now()
print(end - start)
Profiling 5000 256-bit doubling formula (dbl2015rcb) executions...
0:00:05.042373

Running with pyinstrument I get about 90 seconds:

Profiling 5000 256-bit doubling formula (dbl2015rcb) executions...

  _     ._   __/__   _ _  _  _ _/_   Recorded: 17:00:13  Samples:  90372
 /_//_/// /_\ / //_// / //_'/ //     Duration: 90.888    CPU time: 90.885
/   _/                      v4.5.3

Program: pyecsca/test/ec/perf_formula.py

90.888 main  test/ec/perf_formula.py:12
└─ 90.859 DoublingEFDFormula.__call__  pyecsca/ec/formula.py:223
   ├─ 85.403 DoublingEFDFormula.__validate_assumptions  pyecsca/ec/formula.py:154
   │  ├─ 42.555 sympify  sympy/core/sympify.py:101
   │  │     [30 frames hidden]  sympy, <string>, <built-in>, ast
   │  ├─ 20.666 Add.subs  sympy/core/basic.py:838
   │  │     [21 frames hidden]  sympy
   │  ├─ 5.970 Poly.__new__  sympy/polys/polytools.py:162
   │  │     [13 frames hidden]  sympy
   │  ├─ 5.280 FiniteField.__call__  sympy/polys/domains/domain.py:379
   │  │     [6 frames hidden]  sympy
   │  ├─ 5.184 ground_roots  sympy/polys/polytools.py:3718
   │  │     [11 frames hidden]  sympy
   │  ├─ 2.467 resolve  pyecsca/ec/formula.py:197
   │  │  └─ 1.241 resolve  pyecsca/ec/formula.py:197
   │  └─ 1.135 FiniteField.__init__  sympy/polys/domains/finitefield.py:114
   ├─ 3.505 CodeOp.__call__  pyecsca/ec/op.py:146
   │  └─ 2.558 <module>  None
   │     └─ 1.786 GMPMod.method  pyecsca/ec/mod.py:115
   └─ 1.075 [self]  pyecsca/ec/formula.py

Note that it seems the pyinstrument version does not matter, as I get roughly 90 seconds even
if I use 4.3.0. I am using Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] locally.

@joerick
Copy link
Owner

joerick commented Oct 6, 2023

I'm a bit confused, do you think the CPython upgrade caused the performance regression? Or was it always there?

@joerick
Copy link
Owner

joerick commented Oct 6, 2023

Another thing that has changed between these runs is the github actions runner image. But there could also be other factors, like the runner on GHA that actually executed the job. If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with python and 90s when running with pyinstrument, it would indeed seem to be a bug.

@J08nY
Copy link
Author

J08nY commented Oct 6, 2023

I did some more digging and the culprit seems to be the sympy calls.
The situation looks something like this (pyinstrument 4.5.3, sympy 1.12, Python 3.11.5):

No sympy sympy
No pyinstrument 5s 5s
pyinstrument 5s 90s

The sympy calls are really not that performance intensive, in the non-instrumented run they take almost no time at all.

@J08nY
Copy link
Author

J08nY commented Oct 6, 2023

If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with python and 90s when running with pyinstrument, it would indeed seem to be a bug.

I will try to create a simple reproducer. Should not be so hard.

@J08nY
Copy link
Author

J08nY commented Oct 6, 2023

All right, here is a reproducer (sympy 1.12, pyinstrument 4.5.3, Python 3.11.5).
With python sympy_instrument.py, prints This took 0:00:00.636278
With pyinstrument sympy_instrument.py, prints This took 0:00:12.355938

from datetime import datetime

from sympy import sympify, FF, symbols, Poly, Rational


def do_thing():
    # Some elliptic curve crypto stuff that is not important
    field = 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF
    params = {
        "a": 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC,
        "b": 0x5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B,
    }
    k = FF(field)
    expr = sympify(f"3*b - b3", evaluate=False)
    for curve_param, value in params.items():
        expr = expr.subs(curve_param, k(value))
    param = str(expr.free_symbols.pop())

    def resolve(expression, k):
        if not expression.args:
            return expression
        args = []
        for arg in expression.args:
            if isinstance(arg, Rational):
                a = arg.p
                b = arg.q
                res = k(a) / k(b)
            else:
                res = resolve(arg, k)
            args.append(res)
        return expression.func(*args)

    expr = resolve(expr, k)
    poly = Poly(expr, symbols(param), domain=k)
    roots = poly.ground_roots()
    for root in roots:
        params[param] = int(root)
        break


if __name__ == "__main__":
    start = datetime.now()
    for _ in range(1000):
        do_thing()
    end = datetime.now()
    print("This took", end - start)

@J08nY J08nY changed the title Performance regression with 4.5.3 (or cPython?) Performance regression with sympy Oct 6, 2023
@J08nY
Copy link
Author

J08nY commented Oct 6, 2023

The reproducer above, when run with varying the number of iterations leads to this runtime graph.
I also ran it with cProfile, which shows a similar slowdown. Then I tried an empty sys.setprofile function,
that showed no statistically significant overhead (so is not on the plot). Finally, I tried including just time.time()
inside the sys.setprofile function and that seems to match the cProfile/pyinstrument slowdown well.

My conclusion from that is:

  • Sympy triggers a ton of profiling events.
  • To know when to sample, pyinstrument needs to know the time, so every profiling event needs to call some timer.
  • With a large amount of function calls that are very fast (so lots of profiling events), this overhead of getting the time (and then skipping the profiling event) is large.

Figure_1

@joerick
Copy link
Owner

joerick commented Oct 7, 2023

This is highly platform-dependent, I think. Running 1000 iterations on my Mac, I get

$ python examples/sympy_perf.py
This took 0:00:00.403314
$ pyinstrument examples/sympy_perf.py
This took 0:00:00.845693

Which admittedly is a 100% overhead, but nothing like the kind of deltas you're getting here.

What platform are you running on? We have had reports of stuff like this in Docker containers, for example.

@joerick
Copy link
Owner

joerick commented Oct 7, 2023

Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on.

@joerick
Copy link
Owner

joerick commented Oct 8, 2023

I'm assuming you're running this on Linux? I want to test a theory, could you run this code and tell me the output?

#include <stdio.h>
#include <elf.h>
#include <sys/auxv.h>

int main() {
    unsigned long vdso_start = getauxval(AT_SYSINFO_EHDR);
    if (vdso_start) {
        printf("vDSO is active at address: %lx\n", vdso_start);
    } else {
        printf("vDSO does not seem to be active.\n");
    }
    return 0;
}

Save that to a file vdso.c, compile it with clang vdso.c (you might have to apt install clang or similar), then run it with ./a.out.

The vDSO is a mechanism where the kernel makes the timing function gettimeofday really fast by injecting code into userspace to avoid a syscall. But I wonder if it doesn't happen in some cases, leading to these performance issues.

@J08nY
Copy link
Author

J08nY commented Oct 8, 2023

What platform are you running on? We have had reports of stuff like this in Docker containers, for example.

Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on.

Interesting, I am running ArchLinux (6.5.5 kernel) on x86_64. But also saw this in GitHub CI. I am not sure what other details are important.

Btw. I think a simple reproducer is the following:

from datetime import datetime

def func():
    pass

def do_func_loop():
    for _ in range(6000):
        func()

if __name__ == "__main__":
    start = datetime.now()
    for _ in range(1000):
        do_func_loop()
    end = datetime.now()
    print((end - start).total_seconds())

I get:

$ python repro.py
0.185707
$ pyinstrument repro.py
15.766407

Save that to a file vdso.c, compile it with clang vdso.c (you might have to apt install clang or similar), then run it with ./a.out.

Did this, said vDSO is active at address: 7fffff7af000.

I could perhaps try measuring performance of the gettimeofday in C to confirm it is indeed the culprit.

@joerick
Copy link
Owner

joerick commented Oct 8, 2023

I could perhaps try measuring performance of the gettimeofday in C to confirm it is indeed the culprit.

Yeah, this would be the next step, I suppose. There are also other ways to get the time on POSIX, like clock_gettime(). I'll see about writing a script that compares them.

@joerick
Copy link
Owner

joerick commented Oct 8, 2023

Here's a script:

#include <time.h>
#include <stdio.h>
#include <sys/time.h>

typedef enum {
    CT_BASELINE,
    CT_GETTIMEOFDAY,
    CT_MONOTONIC,
    CT_MONOTONIC_RAW,
    CT_CLOCK,
    CT_REALTIME,
    CT_MONOTONIC_COARSE,
} ClockType;

static volatile double dummy_clock = 0.0;

static double
floatclock(ClockType clock_type)
{
    switch (clock_type) {
    case CT_BASELINE:
        return dummy_clock;
    case CT_GETTIMEOFDAY: {
        struct timeval t;
        gettimeofday(&t, (struct timezone *)NULL);
        return (double)t.tv_sec + t.tv_usec*0.000001;
    }
    case CT_MONOTONIC: {
        struct timespec t;
        clock_gettime(CLOCK_MONOTONIC, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
    }
    case CT_MONOTONIC_RAW: {
        struct timespec t;
        clock_gettime(CLOCK_MONOTONIC_RAW, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
    }
    case CT_CLOCK:
        return (double)clock() / CLOCKS_PER_SEC;
    case CT_REALTIME: {
        struct timespec t;
        clock_gettime(CLOCK_REALTIME, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
    }
    case CT_MONOTONIC_COARSE: {
#ifdef CLOCK_MONOTONIC_COARSE
        struct timespec t;
        clock_gettime(CLOCK_MONOTONIC_COARSE, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
#endif
    }
    }
    return 0.0;
}

static double timing_overhead(ClockType clock_type) {
    int NUM_ITERATIONS = 1000000;
    double counter = 0.0;
    double start = floatclock(CT_MONOTONIC);
    for (int i = 0; i < NUM_ITERATIONS; i++) {
        counter += floatclock(clock_type);
    }
    double end = floatclock(CT_MONOTONIC);
    if (counter == 0.0) {
        printf("!no-op! ");
    }
    return end - start;
}

int main()
{
    printf("Baseline: ");
    printf("%.3f\n", timing_overhead(CT_BASELINE));
    printf("gettimeofday: ");
    printf("%.3f\n", timing_overhead(CT_GETTIMEOFDAY));
    printf("clock_gettime(CLOCK_MONOTONIC): ");
    printf("%.3f\n", timing_overhead(CT_MONOTONIC));
    printf("clock_gettime(CLOCK_MONOTONIC_RAW): ");
    printf("%.3f\n", timing_overhead(CT_MONOTONIC_RAW));
    // clock isn't interesting imo
    // printf("clock(): ");
    // printf("%.3f\n", timing_overhead(CT_CLOCK));
    printf("clock_gettime(CLOCK_REALTIME): ");
    printf("%.3f\n", timing_overhead(CT_REALTIME));
    printf("clock_gettime(CLOCK_MONOTONIC_COARSE): ");
    printf("%.3f\n", timing_overhead(CT_MONOTONIC_COARSE));
    return 0;
}

This is what it prints on my system (macOS 13.4, arm64 M1 Max).

$  ./a.out
Baseline: !no-op! 0.007
gettimeofday: 0.030
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.023
clock(): 0.288
clock_gettime(CLOCK_REALTIME): 0.022
clock_gettime(CLOCK_MONOTONIC_COARSE): !no-op! 0.003

In an aarch64 ubuntu Docker container on the same machine:

Baseline: !no-op! 0.006
gettimeofday: 0.054
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.025
clock(): 0.423
clock_gettime(CLOCK_REALTIME): 0.025
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.005

@joerick
Copy link
Owner

joerick commented Oct 8, 2023

I've tried the above script on a few machines. My conclusion is that it's not as simple as the existence of the VDSO. It always seems to be there in my testing. But, the VDSO will make a syscall if it doesn't support the current clock source. If it makes a syscall, it's really slow.

You can inspect the clock source like this:

$ # get the clock source
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
arch_sys_counter

$ # get available clock sources on this machine
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
arch_sys_counter 

$ # I believe you can set the current clock source using this:
$ # echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

In my testing, running Docker containers on arm64 uses a clock source arch_sys_counter, which is supported by VDSO, but running Docker containers on x86_64 gives hyperv_clocksource_tsc_page, which isn't supported by VDSO, making it slow. I suspect something similar is happening to you @J08nY . The surefire way to test this theory is to run the program above with strace, e.g. strace -c ./a.out (maybe reducing the iterations a bit). You can see which clock sources hit syscalls.

@J08nY
Copy link
Author

J08nY commented Oct 8, 2023

This is what it prints on my system (macOS 13.4, arm64 M1 Max).

I get the following:

Baseline: !no-op! 0.007
gettimeofday: 1.213
clock_gettime(CLOCK_MONOTONIC): 1.217
clock_gettime(CLOCK_MONOTONIC_RAW): 1.218
clock_gettime(CLOCK_REALTIME): 1.215
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.007

You can inspect the clock source like this:

I get:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm

The surefire way to test this theory is to run the program above with strace, e.g. strace -c ./a.out (maybe reducing the iterations a bit). You can see which clock sources hit syscalls.

Tested this with just 1000 iterations, I get:

Baseline: !no-op! 0.000
gettimeofday: 0.049
clock_gettime(CLOCK_MONOTONIC): 0.045
clock_gettime(CLOCK_MONOTONIC_RAW): 0.053
clock_gettime(CLOCK_REALTIME): 0.046
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.000
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72,20    0,017519           5      3012           clock_gettime
 23,34    0,005664           5      1000           gettimeofday
  2,20    0,000533         533         1           execve
  0,78    0,000189          23         8           mmap
  0,40    0,000096          16         6           write
  0,21    0,000052          17         3           newfstatat
  0,20    0,000049          24         2           pread64
  0,16    0,000040          20         2           openat
  0,09    0,000023           7         3           mprotect
  0,07    0,000018          18         1           munmap
  0,07    0,000017           5         3           brk
  0,05    0,000013          13         1         1 access
  0,05    0,000012          12         1           read
  0,05    0,000012           6         2           close
  0,05    0,000012           6         2         1 arch_prctl
  0,02    0,000004           4         1           getrandom
  0,01    0,000003           3         1           set_tid_address
  0,01    0,000003           3         1           prlimit64
  0,01    0,000003           3         1           rseq
  0,01    0,000002           2         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100,00    0,024264           5      4052         2 total

So it seems all except the MONOTONIC_COARSE are going through a syscall.

I found two nice blogposts discussing the situation, apparently it has affected quite some projects before. This post has some code for benchmarking the different clock sources. With it I get:

clocks.c
                    clock	       res (ns)	           secs	          nsecs
             gettimeofday	          1,000	  1,696,777,115	    253,319,000
           CLOCK_REALTIME	              1	  1,696,777,115	    253,328,015
    CLOCK_REALTIME_COARSE	      3,333,333	  1,696,777,115	    250,281,928
          CLOCK_MONOTONIC	              1	        257,505	    149,471,014
      CLOCK_MONOTONIC_RAW	              1	        257,501	    989,271,174
   CLOCK_MONOTONIC_COARSE	      3,333,333	        257,505	    146,418,780

ClockBench.cpp
                   Method	samples	   min	    max	    avg	 median	  stdev
           CLOCK_REALTIME	1023	 907.	84578.	1297.50	42742.50  2635.12
    CLOCK_REALTIME_COARSE	1023	   0.	    0.	   0.	    0.	     0.
          CLOCK_MONOTONIC	1023	 838.	 1467.	1093.37	 1152.50   216.66
      CLOCK_MONOTONIC_RAW	1023	 907.	 1467.	1204.10	 1187.00   237.25
   CLOCK_MONOTONIC_COARSE	1023	   0.	    0.	   0.	    0.	     0.
              cpuid+rdtsc	1023	  51.	   68.	  57.88	   59.50     8.34
                   rdtscp	1023	  17.	   34.	  28.33	   25.50     8.01
                    rdtsc	1023	   1.	   17.	  15.24	    9.00     4.85
Using CPU frequency = 1.000000

The COARSE clocks seem unreasonably fast, but if I understand their vdso implementation, it amounts to basically just a memory read, so these could be the real numbers. They also have low precision.

@J08nY
Copy link
Author

J08nY commented Oct 8, 2023

I tried changing the gettimeofday call into a clock_gettime call for CLOCK_MONOTONIC_COARSE in pyinstrument and these are the times I got for the reproducer. I think it may be worthwhile to make pyinstrument more robust to slow clocks by having some mechanism that tests them and picks one that works (perhaps warning the user if the clock chosen cannot satisfy the requested profiling interval.

Figure_2

@joerick joerick changed the title Performance regression with sympy Significant performance overhead when timing requires a syscall on Linux Oct 14, 2023
@joerick
Copy link
Owner

joerick commented Oct 14, 2023

@J08nY I've started some work to improve this here #273

It's still very draft, but on the off-chance that you have time, I'm actually currently hitting a problem on Linux that would be great to get some debugging help on. On macos I can test bits of the C API from ctypes using ctypes.pythonapi, but it doesn't work on linux for some reason. if you get a chance to debug that would help speed things along :)

@joerick
Copy link
Owner

joerick commented Oct 14, 2023

Perhaps ctypes.CDLL(None) would work, or maybe the linker didn't actually load the library yet, since we didn't actually try to call into it before we try to load the symbol.

@J08nY
Copy link
Author

J08nY commented Oct 14, 2023

I will have a look at that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants