-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance variability #3879
Comments
Your CPUs are equiped with caches, 5-10x faster than RAM and falling between 120kB...8MB |
Are you using a recent version of OpenBLAS ? And the benchmarks as distributed with it, or ones you wrote yourself ? |
This is a machine running Windows and WSL Debian. It uses OpenBLAS 0.3.21 on Windows and 0.3.13 on Debian, from stock distributions (vcpkg, mingw+ucrt64, Debian...) Below is what Debian reports for the CPU. I know the issues with cache sizes and how this may affect reproducibility, but precisely because of this I have been warming up the function calls with some memory allocation + writing + deletion of comparable size to the dataset to be used. My expectation was that all function invocations should produce comparable data, when averaged over 100's to 1000's of executions. I attach a capture of a microbenchmark of matrix multiplication using the C++ library I develop where the fluctuations are evident. It shows a comparison between Numpy (using MKL but slower because other reasons), Linux, Msys2 (GCC+W64) and MSVC. On Linux I have tried using openmp / threaded versions of OpenBLAS without any significant change. In any case, this strange jump is not something I have experienced before with OpenBLAS on the same platforms, which is why I reported it. Any help on how to debug this is very much appreciated.
|
Could be compiler-assisted spectre mitigations.
None mentions recoverable source binary can be repeatedly downloaded/built. |
I am not immediately aware of any change between 0.3.13(?) and current develop branch that would account for this. What exactly is "Old Tensor" in your graph (the one with best and most consistent performance) - is it some distribution-supplied binary, or something you built from (which) source ? (If distribution, it may be capped at a low number of threads, which might reduce overhead for small cases, as OpenBLAS' approach to threading is all-or-nothing) |
So in essence your code makes some BLAS call(s) for which OpenBLAS seems to be switching to multithreading too early |
I am only using gemm with different flags for transposed or no transposed matrices, vía OpenBlas' C interface. Some of the versions above relying on shared pointers could internally use atomics, but those would be invoked only once before calling gemm so I do not see why they would interfere with with OpenBlas. This led me to think that matrix alignment might be at stake here, but I am no cache expert and do not understand OpenBlas well enough to judge the impact of memory alignment. As for the environment, I am not doing anything else while these programs are running but these benchmarks are run on actual Windows computers, no virtual machines, so it is hard to control. I will try using our slurm cluster next, but it is difficult to tell whether that would lead to better isolation. |
Currently, transition to multithreading happens (in interface/gemm.c) at |
numpy cflags are here: |
My most recent tests also show worse timings for GEMM when the multithread threshold is increased, so that was probably a bad suggestion anyway. However, getting identical performance curves for supposedly single- and multithreaded runs looks a bit suspicious as the switchover should occur somewhere around M,N,K = 60 . |
Still have not had time to really look into this, but ZEN target is probably missing sone recent improvements in the technically similar HASWELL. So benchmarking a TARGET=HASWELL build on your Ryzen may also provide insights |
I am observing a lot of performance variability for matrix multiplication in sizes ranging from ~100 to ~1000 I have been investigating this without a lot of success. The timing can be up to twice as large, depending on the order in which I run the benchmarks. However, for any chosen order, the accuracy of the timing is high.
I am a bit at a loss here. Because the order only influences the position in memory of the input data to the benchmarks, I am inclined to think that this may be a memory alignment issue. The position of the matrices in the heap is rounded only to 16 bytes (double floats, alignment imposed by C++), but not to any other size and perhaps there is some kind of SIMD code making some kind of ugly magic that disturbs my benchmarks?
I have tested this with serial, openmp and pthreads versions, on Linux and on Windows, with similar outcomes. This is an AMD Ryzen processor, but I have witnessed even greater variability on Intel.
The text was updated successfully, but these errors were encountered: