Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance variability #3879

Open
juanjosegarciaripoll opened this issue Jan 1, 2023 · 13 comments
Open

Performance variability #3879

juanjosegarciaripoll opened this issue Jan 1, 2023 · 13 comments

Comments

@juanjosegarciaripoll
Copy link

I am observing a lot of performance variability for matrix multiplication in sizes ranging from ~100 to ~1000 I have been investigating this without a lot of success. The timing can be up to twice as large, depending on the order in which I run the benchmarks. However, for any chosen order, the accuracy of the timing is high.

I am a bit at a loss here. Because the order only influences the position in memory of the input data to the benchmarks, I am inclined to think that this may be a memory alignment issue. The position of the matrices in the heap is rounded only to 16 bytes (double floats, alignment imposed by C++), but not to any other size and perhaps there is some kind of SIMD code making some kind of ugly magic that disturbs my benchmarks?

I have tested this with serial, openmp and pthreads versions, on Linux and on Windows, with similar outcomes. This is an AMD Ryzen processor, but I have witnessed even greater variability on Intel.

@brada4
Copy link
Contributor

brada4 commented Jan 1, 2023

Your CPUs are equiped with caches, 5-10x faster than RAM and falling between 120kB...8MB
Some numbers like versions and cpuid's are needed from you to make something actionable out of coincidental story so far.

@martin-frbg
Copy link
Collaborator

Are you using a recent version of OpenBLAS ? And the benchmarks as distributed with it, or ones you wrote yourself ?
Cpu frequency settings (BIOS/tools/choice of CPUfreq governor on Linux) may play a role. Do you see this only with OpenBLAS, or with comparable libraries as well ?

@juanjosegarciaripoll
Copy link
Author

This is a machine running Windows and WSL Debian. It uses OpenBLAS 0.3.21 on Windows and 0.3.13 on Debian, from stock distributions (vcpkg, mingw+ucrt64, Debian...) Below is what Debian reports for the CPU.

I know the issues with cache sizes and how this may affect reproducibility, but precisely because of this I have been warming up the function calls with some memory allocation + writing + deletion of comparable size to the dataset to be used. My expectation was that all function invocations should produce comparable data, when averaged over 100's to 1000's of executions.

I attach a capture of a microbenchmark of matrix multiplication using the C++ library I develop where the fluctuations are evident. It shows a comparison between Numpy (using MKL but slower because other reasons), Linux, Msys2 (GCC+W64) and MSVC. On Linux I have tried using openmp / threaded versions of OpenBLAS without any significant change.

In any case, this strange jump is not something I have experienced before with OpenBLAS on the same platforms, which is why I reported it. Any help on how to debug this is very much appreciated.

processor       : 0-31
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 113
model name      : AMD Ryzen 9 3950X 16-Core Processor
stepping        : 0
microcode       : 0xffffffff
cpu MHz         : 3493.456
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 15
cpu cores       : 16
apicid          : 31
initial apicid  : 31
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip rdpid
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed
bogomips        : 6986.91
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

image

@brada4
Copy link
Contributor

brada4 commented Jan 3, 2023

Could be compiler-assisted spectre mitigations.
On logarithm scale - graphs are quite close to n^3 (i.e time logarithm grows thrice the size).
Tooth is around R6kB+R6kB -> W6kB - does not round to any cache /page sizes. Maybe threading threshold

distributions (vcpkg, mingw+ucrt64

None mentions recoverable source binary can be repeatedly downloaded/built.
Do you mean that debian (or is it that ubuntu) distro version works absolutely best and super stable?

@martin-frbg
Copy link
Collaborator

I am not immediately aware of any change between 0.3.13(?) and current develop branch that would account for this. What exactly is "Old Tensor" in your graph (the one with best and most consistent performance) - is it some distribution-supplied binary, or something you built from (which) source ? (If distribution, it may be capped at a low number of threads, which might reduce overhead for small cases, as OpenBLAS' approach to threading is all-or-nothing)

@juanjosegarciaripoll
Copy link
Author

Hi, I am showing versions built on different platforms, which are mostly identical, but "Old Tensor" uses Debian's default OpenBLAS selected by autoconf (I believe it is the serial one) while the other plots use a CMake-built version of my library that favors openblas-pthread or openblas-openmp. The vcpkg (MSVC) is the threaded version as well.

In any case, those spikes fluctuate a lot even within the same platform. Here is a re-run of the same benchmarks on all platforms
image

If I run the tests with real matrices instead of complex ones, the results look different. The serial version from "Old Tensor" is more stable. and the spikes move to larger matrix dimension
image

As for the libraries, vcpkg builds its own versions, msys2 (in the ucrt64 runtime which I am using) and Debian both provide pre-built binaries which I use.

@martin-frbg
Copy link
Collaborator

So in essence your code makes some BLAS call(s) for which OpenBLAS seems to be switching to multithreading too early
(at too small matrix sizes, as evidenced by the poorer performance compared to what is assumed to be a single-threaded build). And there appears to be some other effect sometimes overlaid, where even the serial code takes a lot longer than expected at some specific matrix size ? (Could be some interference by unrelated system jobs, and/or the process getting moved to another core/chiplet)
Now what BLAS calls does your test code make, is it only GEMM or others as well ?

@juanjosegarciaripoll
Copy link
Author

I am only using gemm with different flags for transposed or no transposed matrices, vía OpenBlas' C interface.

Some of the versions above relying on shared pointers could internally use atomics, but those would be invoked only once before calling gemm so I do not see why they would interfere with with OpenBlas. This led me to think that matrix alignment might be at stake here, but I am no cache expert and do not understand OpenBlas well enough to judge the impact of memory alignment.

As for the environment, I am not doing anything else while these programs are running but these benchmarks are run on actual Windows computers, no virtual machines, so it is hard to control. I will try using our slurm cluster next, but it is difficult to tell whether that would lead to better isolation.

@martin-frbg
Copy link
Collaborator

Currently, transition to multithreading happens (in interface/gemm.c) at M * N * K > 262140 (where the 262140 is 65535 times the compile-time constant GEMM_MULTITHREAD_THRESHOLD from Makefile.rule) - it might be worthwile to try building your own copy of OpenBLAS from source with GEMM_MULTITHREAD_THRESHOLD set to 16 (or even higher),
and/or to run your test with OPENBLAS_NUM_THREADS=1 to force single-threading across the entire size range of interest.
There is some infrastructure already in place for specialized "small matrix" GEMM kernels, but such kernels are currently only implemented for SKYLAKEX and POWER10 targets.

@juanjosegarciaripoll
Copy link
Author

I do not think this is related to multithreading. I have set OPENBLAS_NUM_THREADS=1 and get the same results. Please compare the red and purple lines in this plot. The red uses no special settings, with MSYS2' default OpenBLAS, while the purple line sets the number of threads to 1
image

@brada4
Copy link
Contributor

brada4 commented Jan 13, 2023

numpy cflags are here:
https://github.com/conda/conda-build/blob/main/conda_build/jinja_context.py#L422
not changed (check dll/so imports with dependency walker/ldd if any msomp/gomp/iomp is imported only variations are openmp, thread number at build, and interface bitness)
https://github.com/conda-forge/openblas-feedstock/blob/main/recipe/build.sh

@martin-frbg
Copy link
Collaborator

My most recent tests also show worse timings for GEMM when the multithread threshold is increased, so that was probably a bad suggestion anyway. However, getting identical performance curves for supposedly single- and multithreaded runs looks a bit suspicious as the switchover should occur somewhere around M,N,K = 60 .
Any bad effects peculiar to your hardware (or WSL) like cache eviction or competing processes should appear non-deterministically at any size, so I am running out of ideas for now. Similarly for data alignment, I do not see why it would have pronounced bad effects in a particular size range only. Maybe you could try if building with USE_SIMPLE_THREADED_LEVEL3=1 has any influence, this would at least employ a different code path ?

@martin-frbg
Copy link
Collaborator

Still have not had time to really look into this, but ZEN target is probably missing sone recent improvements in the technically similar HASWELL. So benchmarking a TARGET=HASWELL build on your Ryzen may also provide insights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants