-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rust Bindings] Poor performance VS ndarray (BLAS) and optimized iteration impls #107
Comments
Hi @ChillFish8! Which version of SimSIMD are you using? AVX2 for |
I can't currently give access to the project this is ran on, but I can give a copy of the benchmark file minus some of the custom avx stuff, but realistically it is probably best to just worry about
|
To be more specific, the numbers |
The SimSIMD repository contains Rust benchmarks against native implementations. Maybe they are poorly implemented... Can you try cloning the SimSIMD repository and then running the benchmarks, as described in the CONTRIBUTING.md. cargo bench Please check out the |
Using the repo benches, by default I get:
|
If I use the changes in PR #108 I get the following:
|
The compiler command being ran compiling the C code is:
|
If we tell the compiler that
|
Is that all still on the same Ryzen CPU, @ChillFish8? I was just refreshing the ParallelReductionsBenchmark and added a loop-unrolled variant with scalar code in the C++ layer. It still looses to SIMD even for $ build_release/reduce_bench
You did not feed the size of arrays, so we will use a 1GB array!
2024-05-06T00:11:14+00:00
Running build_release/reduce_bench
Run on (160 X 2100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x160)
L1 Instruction 32 KiB (x160)
L2 Unified 4096 KiB (x80)
L3 Unified 16384 KiB (x2)
Load Average: 3.23, 19.01, 13.71
----------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------
unrolled<f32>/min_time:10.000/real_time 149618549 ns 149615366 ns 95 bytes/s=7.17653G/s error,%=50
unrolled<f64>/min_time:10.000/real_time 146594731 ns 146593719 ns 95 bytes/s=7.32456G/s error,%=0
avx2<f32>/min_time:10.000/real_time 110796474 ns 110794861 ns 127 bytes/s=9.69112G/s error,%=50
avx2<f32kahan>/min_time:10.000/real_time 134144762 ns 134137771 ns 105 bytes/s=8.00435G/s error,%=0
avx2<f64>/min_time:10.000/real_time 115791797 ns 115790878 ns 121 bytes/s=9.27304G/s error,%=0 You can find more results in that repos README. |
Hey, yes but it is worth noting in my last comment what is happening under the hood, is LLVM is autovectorizing that loop and using FMA instructions because it's been allowed to assume AVX2 and FMA support. |
I believe this is related to #148 and can be improved with the next PR 🤗 |
Hey, @ChillFish8! Are you observing the same performance issues with the most recent 5.0.1 release as well? |
I can add it back to our benchmarks and give it a test, will let you know shortly |
Adding simsimd back to our benchmarks on the distance functions, it seems better but there is definitely something wrong with
|
On AVX512 Zen4 it behaves effectively as expected:
|
Which machine are these numbers coming from? Is that an Arm machine? Is there SVE available? |
They are on a Ryzen Zen3 chip
|
I'm not sure if it is any help, but the behaviour the |
In some cases, on older AMD CPUs, the latency of some instructions was too high and the compilers preferred using serial code. I think for now we can close this issue, but it's good to keep those differences in mind for future benchmarks. Thank you, @ChillFish8! |
While I think that assumption is wrong, ultimately it is your choice. I think regardless though it may be worth making a note of this performance footgun in the library. As for generally speaking, this library becomes unusable for anyone running on most AMD server hardware and likely any other CPU using AVX2 and FMA only (AWS and GCP general compute instances for example) |
Recently we've been implementing some spacial distance functions and benchmarking them against some existing libraries, when testing with high dimensional data (1024 dims) we observe
simsimd
taking on average619ns
per vector, compared to ndarray (when backed by openblas) taking43ns
or an optimized bit of pure Rust taking234ns
and95ns
with ffast-math like intrinsics disabled/enabled respectively.These benchmarks are taken with Criterion doing 1,000 vector ops per iteration in order to account for any clock accuracy issues due to the low ns time.
Notes
AMD Ryzen 9 5900X 12-Core Processor, 3701 Mhz, 12 Core(s), 24 Logical Processor(s)
0.5.1
, Openblas0.3.25
RUSTFLAGS="-C target-feature=+avx2,+fma"
RUSTFLAGS="-C target-cpu=native"
Loose benchmark structure (within Criterion)
There is a bit too much code to paste the exact benchmarks, but each step is the following:
Pure Rust impl
Below is a fallback impl I've made, for simplicity I've removed the generic which was used to replace regular math operations with their ffast-math equivalents when running the
dot fallback 1024 fma
benchmark, however, the asm fordot fallback 1024 nofma
are identical.Notes
8
so we don't have an additional loop to do the remainder ifDIMS
were to not be a multiple of8
, that being said, even with that final loop, the difference is minimal.The text was updated successfully, but these errors were encountered: