Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust Bindings] Poor performance VS ndarray (BLAS) and optimized iteration impls #107

Open
ChillFish8 opened this issue Apr 5, 2024 · 10 comments
Labels
invalid This doesn't seem right

Comments

@ChillFish8
Copy link
Contributor

Recently we've been implementing some spacial distance functions and benchmarking them against some existing libraries, when testing with high dimensional data (1024 dims) we observe simsimd taking on average 619ns per vector, compared to ndarray (when backed by openblas) taking 43ns or an optimized bit of pure Rust taking 234ns and 95ns with ffast-math like intrinsics disabled/enabled respectively.

These benchmarks are taken with Criterion doing 1,000 vector ops per iteration in order to account for any clock accuracy issues due to the low ns time.

dot ndarray 1024 auto   time:   [43.270 µs 43.285 µs 43.302 µs]
Found 17 outliers among 500 measurements (3.40%)
  5 (1.00%) high mild
  12 (2.40%) high severe

Benchmarking dot simsimd 1024 auto: Warming up for 3.0000 s
Warning: Unable to complete 500 samples in 60.0s. You may wish to increase target time to 77.7s, enable flat sampling, or reduce sample count to 310.
dot simsimd 1024 auto   time:   [618.85 µs 619.93 µs 621.15 µs]
Found 43 outliers among 500 measurements (8.60%)
  7 (1.40%) low mild
  17 (3.40%) high mild
  19 (3.80%) high severe

dot fallback 1024 nofma time:   [232.92 µs 234.19 µs 235.76 µs]
Found 16 outliers among 500 measurements (3.20%)
  11 (2.20%) high mild
  5 (1.00%) high severe

dot fallback 1024 fma   time:   [95.456 µs 95.586 µs 95.729 µs]
Found 19 outliers among 500 measurements (3.80%)
  17 (3.40%) high mild
  2 (0.40%) high severe

Notes

  • CPU: AMD Ryzen 9 5900X 12-Core Processor, 3701 Mhz, 12 Core(s), 24 Logical Processor(s)
  • Benchmarked with Criterion 0.5.1, Openblas 0.3.25
  • Compiled with RUSTFLAGS="-C target-feature=+avx2,+fma"
    • Results can also be replicated via RUSTFLAGS="-C target-cpu=native"
  • We only measure ndarray for dot product as there are no blas specific ops for Euclidean or Cosine distance, but a similar performance difference can be observed between the pure rust and simsimd versions for those additional distance measures.

Loose benchmark structure (within Criterion)

There is a bit too much code to paste the exact benchmarks, but each step is the following:

fn bench_me(a: &[f32], b: &[f32]) {
   for _ in 0..1_000 {
       black_box(implementation_dot(black_box(a), black_box(b)));
   }
}

Pure Rust impl

Below is a fallback impl I've made, for simplicity I've removed the generic which was used to replace regular math operations with their ffast-math equivalents when running the dot fallback 1024 fma benchmark, however, the asm for dot fallback 1024 nofma are identical.

Notes

  • We only target vectors that can fit into a multiple of 8 so we don't have an additional loop to do the remainder if DIMS were to not be a multiple of 8, that being said, even with that final loop, the difference is minimal.
unsafe fn fallback_dot_product_demo<const DIMS: usize>(
    a: &[f32],
    b: &[f32],
) -> f32 {
    debug_assert_eq!(
        b.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        a.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        DIMS % 8,
        0,
        "DIMS must be able to fit entirely into chunks of 8 lanes."
    );

    let mut i = 0;

    // We do this manual unrolling to allow the compiler to vectorize
    // the loop and avoid some branching even if we're not doing it explicitly.
    // This made a significant difference in benchmarking ~4x
    let mut acc1 = 0.0;
    let mut acc2 = 0.0;
    let mut acc3 = 0.0;
    let mut acc4 = 0.0;
    let mut acc5 = 0.0;
    let mut acc6 = 0.0;
    let mut acc7 = 0.0;
    let mut acc8 = 0.0;

    while i < a.len() {
        let a1 = *a.get_unchecked(i);
        let a2 = *a.get_unchecked(i + 1);
        let a3 = *a.get_unchecked(i + 2);
        let a4 = *a.get_unchecked(i + 3);
        let a5 = *a.get_unchecked(i + 4);
        let a6 = *a.get_unchecked(i + 5);
        let a7 = *a.get_unchecked(i + 6);
        let a8 = *a.get_unchecked(i + 7);

        let b1 = *b.get_unchecked(i);
        let b2 = *b.get_unchecked(i + 1);
        let b3 = *b.get_unchecked(i + 2);
        let b4 = *b.get_unchecked(i + 3);
        let b5 = *b.get_unchecked(i + 4);
        let b6 = *b.get_unchecked(i + 5);
        let b7 = *b.get_unchecked(i + 6);
        let b8 = *b.get_unchecked(i + 7);

        acc1 = acc1 + (a1 * b1);
        acc2 = acc2 + (a2 * b2);
        acc3 = acc3 + (a3 * b3);
        acc4 = acc4 + (a4 * b4);
        acc5 = acc5 + (a5 * b5);
        acc6 = acc6 + (a6 * b6);
        acc7 = acc7 + (a7 * b7);
        acc8 = acc8 + (a8 * b8);

        i += 8;
    }

    acc1 = acc1 + acc2;
    acc3 = acc3 + acc4;
    acc5 = acc5 + acc6;
    acc7 = acc7 + acc8;
    
    acc1 = acc1 + acc3;
    acc5 = acc5 + acc7;

    acc1 + acc5
}
@ashvardanian
Copy link
Owner

Hi @ChillFish8! Which version of SimSIMD are you using?

AVX2 for float32 is practically the only SIMD+datatype combo we don't implement, as that's the only one that compilers vectorize well 😆 But your result is still very weird. Do you have a project I can clone and run to reproduce that?

@ashvardanian ashvardanian added the invalid This doesn't seem right label Apr 5, 2024
@ChillFish8
Copy link
Contributor Author

I can't currently give access to the project this is ran on, but I can give a copy of the benchmark file minus some of the custom avx stuff, but realistically it is probably best to just worry about simsimd vs rust vs blas for this issue.

cargo.toml

[package]
name = "benchmark-demo"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]

[dev-dependencies]
rand = "0.8.5"
simsimd = "4.2.2"

criterion = { version = "0.5.1", features = ["html_reports"] }

[target.'cfg(unix)'.dev-dependencies]
ndarray = { version = "0.15.6", features = ["blas"] }
blas-src = { version = "0.8", features = ["openblas"] }
openblas-src = { version = "0.10", features = ["cblas", "system"] }

[target.'cfg(not(unix))'.dev-dependencies]
ndarray = "0.15.6"

bench_dot_product.rs

#[cfg(unix)]
extern crate blas_src;

use std::hint::black_box;
use std::time::Duration;

use criterion::{criterion_group, criterion_main, Criterion};
use simsimd::SpatialSimilarity;

fn simsimd_dot(a: &[f32], b: &[f32]) -> f32 {
    f32::dot(a, b).unwrap_or_default() as f32
}

fn ndarray_dot(a: &ndarray::Array1<f32>, b: &ndarray::Array1<f32>) -> f32 {
    a.dot(b)
}

unsafe fn fallback_dot_product_demo<const DIMS: usize>(
    a: &[f32],
    b: &[f32],
) -> f32 {
    debug_assert_eq!(
        b.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        a.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        DIMS % 8,
        0,
        "DIMS must be able to fit entirely into chunks of 8 lanes."
    );

    let mut i = 0;

    // We do this manual unrolling to allow the compiler to vectorize
    // the loop and avoid some branching even if we're not doing it explicitly.
    // This made a significant difference in benchmarking ~4x
    let mut acc1 = 0.0;
    let mut acc2 = 0.0;
    let mut acc3 = 0.0;
    let mut acc4 = 0.0;
    let mut acc5 = 0.0;
    let mut acc6 = 0.0;
    let mut acc7 = 0.0;
    let mut acc8 = 0.0;

    while i < a.len() {
        let a1 = *a.get_unchecked(i);
        let a2 = *a.get_unchecked(i + 1);
        let a3 = *a.get_unchecked(i + 2);
        let a4 = *a.get_unchecked(i + 3);
        let a5 = *a.get_unchecked(i + 4);
        let a6 = *a.get_unchecked(i + 5);
        let a7 = *a.get_unchecked(i + 6);
        let a8 = *a.get_unchecked(i + 7);

        let b1 = *b.get_unchecked(i);
        let b2 = *b.get_unchecked(i + 1);
        let b3 = *b.get_unchecked(i + 2);
        let b4 = *b.get_unchecked(i + 3);
        let b5 = *b.get_unchecked(i + 4);
        let b6 = *b.get_unchecked(i + 5);
        let b7 = *b.get_unchecked(i + 6);
        let b8 = *b.get_unchecked(i + 7);

        acc1 = acc1 + (a1 * b1);
        acc2 = acc2 + (a2 * b2);
        acc3 = acc3 + (a3 * b3);
        acc4 = acc4 + (a4 * b4);
        acc5 = acc5 + (a5 * b5);
        acc6 = acc6 + (a6 * b6);
        acc7 = acc7 + (a7 * b7);
        acc8 = acc8 + (a8 * b8);

        i += 8;
    }

    acc1 = acc1 + acc2;
    acc3 = acc3 + acc4;
    acc5 = acc5 + acc6;
    acc7 = acc7 + acc8;

    acc1 = acc1 + acc3;
    acc5 = acc5 + acc7;

    acc1 + acc5
}

macro_rules! repeat {
    ($n:expr, $val:block) => {{
        for _ in 0..$n {
            black_box($val);
        }
    }};
}

fn criterion_benchmark(c: &mut Criterion) {
    // Hey, this benchmark behaves drastically different if you are on Windows VS unix.
    // This is because on unix we do a more realistic benchmark and compare ndarray backed
    // by openblas rather than with the standard rust impl.
    c.bench_function("dot ndarray 1024 auto", |b| {
        use ndarray::Array1;

        let mut v1 = Vec::new();
        let mut v2 = Vec::new();
        for _ in 0..1024 {
            v1.push(rand::random());
            v2.push(rand::random());
        }

        let v1 = Array1::from_shape_vec((1024,), v1).unwrap();
        let v2 = Array1::from_shape_vec((1024,), v2).unwrap();

        b.iter(|| repeat!(1000, { ndarray_dot(black_box(&v1), black_box(&v2)) }))
    });    
    c.bench_function("dot simsimd 1024 auto", |b| {
        let mut v1 = Vec::new();
        let mut v2 = Vec::new();
        for _ in 0..1024 {
            v1.push(rand::random());
            v2.push(rand::random());
        }

        b.iter(|| repeat!(1000, { simsimd_dot(black_box(&v1), black_box(&v2)) }))
    });
    c.bench_function("dot fallback 1024 nofma", |b| {
        let mut v1 = Vec::new();
        let mut v2 = Vec::new();
        for _ in 0..1024 {
            v1.push(rand::random());
            v2.push(rand::random());
        }

        b.iter(|| repeat!(1000, { 
            unsafe { fallback_dot_product_demo::<1024>(black_box(&v1), black_box(&v2)) }
        }))
    });
}

criterion_group!(
    name = benches;
    config = Criterion::default()
        .measurement_time(Duration::from_secs(60))
        .sample_size(500);
    targets = criterion_benchmark
);
criterion_main!(benches);

@ChillFish8
Copy link
Contributor Author

To be more specific, the numbers simsimd is getting for AVX2 and f32 values seem to be more or less in line with iterating through the two vectors and getting the dot product, but without the compiler being able to correctly vectorize the loop. So maybe the compiler for simsimd is not actually vectorizing the loop fully or at all.

@ashvardanian
Copy link
Owner

The SimSIMD repository contains Rust benchmarks against native implementations. Maybe they are poorly implemented... Can you try cloning the SimSIMD repository and then running the benchmarks, as described in the CONTRIBUTING.md.

cargo bench

Please check out the main branch version and the main-dev. I'd be happy to optimize the kernels further, but I am not sure that is possible. If the issue persists, it might be related to compilation settings 🤗

@ChillFish8
Copy link
Contributor Author

Using the repo benches, by default I get:

SIMD Cosine/SimSIMD/0   time:   [990.33 ns 991.20 ns 992.21 ns]
                        change: [-0.5469% -0.1196% +0.1941%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/0
                        time:   [997.99 ns 1.0023 µs 1.0066 µs]
                        change: [+0.8535% +1.1800% +1.5240%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/1   time:   [1.0071 µs 1.0112 µs 1.0159 µs]
                        change: [-0.5979% -0.0751% +0.4182%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
SIMD Cosine/Rust Native/1
                        time:   [995.26 ns 997.31 ns 999.95 ns]
                        change: [-3.4249% -2.3587% -1.4896%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/2   time:   [992.49 ns 993.86 ns 995.36 ns]
                        change: [-0.6670% -0.3172% +0.0164%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/2
                        time:   [999.39 ns 1.0017 µs 1.0040 µs]
                        change: [+0.8312% +1.0924% +1.3528%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/SimSIMD/3   time:   [999.12 ns 1.0029 µs 1.0071 µs]
                        change: [-0.8765% -0.3084% +0.1971%] (p = 0.28 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/3
                        time:   [995.69 ns 997.72 ns 999.69 ns]
                        change: [+0.6852% +0.9139% +1.1508%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
SIMD Cosine/SimSIMD/4   time:   [989.46 ns 991.39 ns 993.36 ns]
                        change: [-2.4808% -1.7419% -1.1702%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
SIMD Cosine/Rust Native/4
                        time:   [984.42 ns 985.22 ns 986.16 ns]
                        change: [-1.9665% -1.4544% -0.9763%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
SIMD Cosine/SimSIMD/5   time:   [984.21 ns 985.94 ns 987.71 ns]
                        change: [-1.6544% -1.1956% -0.8287%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD Cosine/Rust Native/5
                        time:   [987.03 ns 988.30 ns 989.81 ns]
                        change: [+1.0143% +1.1866% +1.3575%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

     Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-1c498acee1c38350)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
                        time:   [964.05 ns 967.69 ns 971.67 ns]
                        change: [-1.6473% -1.2355% -0.8248%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/0
                        time:   [973.53 ns 975.20 ns 977.10 ns]
                        change: [+186.66% +187.37% +188.16%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
SIMD SqEuclidean/SimSIMD/1
                        time:   [952.89 ns 954.25 ns 955.68 ns]
                        change: [-2.9500% -2.5561% -2.2074%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/1
                        time:   [973.70 ns 975.53 ns 977.30 ns]
                        change: [+186.14% +186.69% +187.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
SIMD SqEuclidean/SimSIMD/2
                        time:   [965.95 ns 968.58 ns 971.30 ns]
                        change: [-1.8963% -1.5119% -1.1299%] (p = 0.00 < 0.05)
                        Performance has improved.
SIMD SqEuclidean/Rust Native/2
                        time:   [971.81 ns 973.68 ns 975.83 ns]
                        change: [+181.90% +183.47% +184.85%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/3
                        time:   [957.05 ns 958.81 ns 960.71 ns]
                        change: [-3.2849% -2.8105% -2.3846%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/3
                        time:   [971.49 ns 972.77 ns 974.15 ns]
                        change: [+177.36% +179.33% +181.00%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/4
                        time:   [958.75 ns 962.49 ns 966.77 ns]
                        change: [-2.8413% -2.4086% -2.0098%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
SIMD SqEuclidean/Rust Native/4
                        time:   [977.67 ns 981.15 ns 984.38 ns]
                        change: [+183.37% +184.79% +186.12%] (p = 0.00 < 0.05)
                        Performance has regressed.
SIMD SqEuclidean/SimSIMD/5
                        time:   [957.25 ns 959.29 ns 961.63 ns]
                        change: [-3.4224% -3.1009% -2.8216%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/5
                        time:   [977.04 ns 979.62 ns 982.15 ns]
                        change: [+182.34% +184.11% +185.86%] (p = 0.00 < 0.05)
                        Performance has regressed.

@ChillFish8
Copy link
Contributor Author

If I use the changes in PR #108 I get the following:

SIMD Cosine/SimSIMD/0   time:   [995.61 ns 997.99 ns 1.0008 µs]
                        change: [+0.1468% +0.4000% +0.6799%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
SIMD Cosine/Rust Native/0
                        time:   [755.37 ns 758.73 ns 764.37 ns]
                        change: [-24.342% -24.086% -23.766%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/1   time:   [985.11 ns 986.34 ns 987.60 ns]
                        change: [-2.4883% -2.1633% -1.8513%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/Rust Native/1
                        time:   [752.29 ns 754.33 ns 757.00 ns]
                        change: [-25.113% -24.900% -24.675%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/2   time:   [987.52 ns 988.61 ns 989.83 ns]
                        change: [-0.5561% -0.3441% -0.1497%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/Rust Native/2
                        time:   [751.62 ns 752.32 ns 753.19 ns]
                        change: [-25.024% -24.896% -24.770%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
SIMD Cosine/SimSIMD/3   time:   [987.02 ns 988.13 ns 989.34 ns]
                        change: [-1.7928% -1.4180% -1.0880%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/Rust Native/3
                        time:   [751.43 ns 751.82 ns 752.29 ns]
                        change: [-25.020% -24.925% -24.828%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/4   time:   [989.97 ns 990.71 ns 991.66 ns]
                        change: [-0.0446% +0.1065% +0.2536%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
SIMD Cosine/Rust Native/4
                        time:   [750.46 ns 751.02 ns 751.60 ns]
                        change: [-23.947% -23.833% -23.728%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
SIMD Cosine/SimSIMD/5   time:   [988.47 ns 989.15 ns 989.97 ns]
                        change: [+0.4132% +0.5962% +0.7772%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/5
                        time:   [751.42 ns 752.31 ns 753.38 ns]
                        change: [-24.095% -23.966% -23.843%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

     Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-1c498acee1c38350)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
                        time:   [954.47 ns 956.11 ns 957.70 ns]
                        change: [-1.1162% -0.7026% -0.3014%] (p = 0.00 < 0.05)
                        Change within noise threshold.
SIMD SqEuclidean/Rust Native/0
                        time:   [366.84 ns 367.18 ns 367.53 ns]
                        change: [-62.453% -62.353% -62.261%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/1
                        time:   [946.73 ns 947.48 ns 948.28 ns]
                        change: [-0.9722% -0.8084% -0.6503%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
SIMD SqEuclidean/Rust Native/1
                        time:   [365.67 ns 365.83 ns 366.01 ns]
                        change: [-62.469% -62.396% -62.323%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
SIMD SqEuclidean/SimSIMD/2
                        time:   [947.38 ns 949.31 ns 951.74 ns]
                        change: [-2.0238% -1.7564% -1.4912%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/2
                        time:   [365.85 ns 366.11 ns 366.40 ns]
                        change: [-62.605% -62.540% -62.476%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
SIMD SqEuclidean/SimSIMD/3
                        time:   [952.75 ns 954.40 ns 956.08 ns]
                        change: [-0.7782% -0.3103% +0.1819%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/3
                        time:   [367.71 ns 368.50 ns 369.52 ns]
                        change: [-62.255% -62.179% -62.096%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
SIMD SqEuclidean/SimSIMD/4
                        time:   [946.24 ns 947.68 ns 949.34 ns]
                        change: [-1.3054% -0.9476% -0.5958%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/4
                        time:   [368.88 ns 370.15 ns 371.65 ns]
                        change: [-62.285% -62.067% -61.779%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
SIMD SqEuclidean/SimSIMD/5
                        time:   [954.79 ns 955.77 ns 956.94 ns]
                        change: [-0.1110% +0.1493% +0.4162%] (p = 0.26 > 0.05)
                        No change in performance detected.
SIMD SqEuclidean/Rust Native/5
                        time:   [366.50 ns 366.84 ns 367.23 ns]
                        change: [-62.811% -62.688% -62.566%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

@ChillFish8
Copy link
Contributor Author

The compiler command being ran compiling the C code is:

"cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-I" "include" "-O3" "-std=c99" "-pedantic" "-DSIMSIMD_NATIVE_F16=0" "-DSIMSIMD_DYNAMIC_DISPATCH=1" "-DSIMSIMD_TARGET_SAPPHIRE=0" "-o" "/home/personal/simsimd/target/release/build/simsimd-be318405a648c44f/out/c/lib.o" "-c" "c/lib.c"

@ChillFish8
Copy link
Contributor Author

If we tell the compiler that avx2 and fma can be targetted, we get an even faster version of the native Rust code, but no effect on the C side:

RUSTFLAGS="-C target-feature=+avx2,+fma" cargo bench -- --nocapture
SIMD Cosine/SimSIMD/0   time:   [981.74 ns 983.39 ns 985.48 ns]
                        change: [-1.5396% -1.2668% -0.9837%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe
SIMD Cosine/Rust Native/0
                        time:   [130.86 ns 130.95 ns 131.06 ns]
                        change: [-82.739% -82.683% -82.640%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/SimSIMD/1   time:   [983.62 ns 985.05 ns 987.02 ns]
                        change: [-0.5092% -0.3685% -0.2163%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
SIMD Cosine/Rust Native/1
                        time:   [131.07 ns 131.21 ns 131.34 ns]
                        change: [-82.568% -82.529% -82.498%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  6 (6.00%) low severe
  9 (9.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/SimSIMD/2   time:   [981.05 ns 982.28 ns 983.70 ns]
                        change: [-1.0060% -0.8903% -0.7706%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/2
                        time:   [131.01 ns 131.09 ns 131.17 ns]
                        change: [-82.575% -82.548% -82.516%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/SimSIMD/3   time:   [980.46 ns 981.49 ns 982.76 ns]
                        change: [-0.2110% -0.0435% +0.1324%] (p = 0.64 > 0.05)
                        No change in performance detected.
SIMD Cosine/Rust Native/3
                        time:   [130.89 ns 131.03 ns 131.24 ns]
                        change: [-82.550% -82.529% -82.510%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/SimSIMD/4   time:   [978.19 ns 978.80 ns 979.51 ns]
                        change: [-1.2474% -1.1591% -1.0734%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
SIMD Cosine/Rust Native/4
                        time:   [131.07 ns 131.18 ns 131.28 ns]
                        change: [-82.580% -82.562% -82.546%] (p = 0.00 < 0.05)
                        Performance has improved.
SIMD Cosine/SimSIMD/5   time:   [982.41 ns 982.88 ns 983.39 ns]
                        change: [-0.9772% -0.8781% -0.7844%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD Cosine/Rust Native/5
                        time:   [132.08 ns 132.25 ns 132.44 ns]
                        change: [-82.460% -82.416% -82.372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

     Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-789b6d1bba04e87b)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
                        time:   [953.51 ns 955.58 ns 957.60 ns]
                        change: [-0.6461% -0.4286% -0.2139%] (p = 0.00 < 0.05)
                        Change within noise threshold.
SIMD SqEuclidean/Rust Native/0
                        time:   [117.68 ns 120.45 ns 123.87 ns]
                        change: [-68.098% -67.815% -67.421%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe
SIMD SqEuclidean/SimSIMD/1
                        time:   [955.73 ns 963.38 ns 973.22 ns]
                        change: [+0.4900% +0.8694% +1.4353%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
SIMD SqEuclidean/Rust Native/1
                        time:   [116.90 ns 117.05 ns 117.22 ns]
                        change: [-67.916% -67.849% -67.782%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/2
                        time:   [948.83 ns 949.71 ns 950.67 ns]
                        change: [+0.2005% +0.3694% +0.5291%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/2
                        time:   [117.09 ns 117.52 ns 117.91 ns]
                        change: [-68.257% -68.178% -68.101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/3
                        time:   [965.79 ns 968.94 ns 972.52 ns]
                        change: [+1.0966% +1.6960% +2.2373%] (p = 0.00 < 0.05)
                        Performance has regressed.
SIMD SqEuclidean/Rust Native/3
                        time:   [118.14 ns 118.67 ns 119.21 ns]
                        change: [-68.157% -68.036% -67.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
SIMD SqEuclidean/SimSIMD/4
                        time:   [959.39 ns 962.01 ns 965.08 ns]
                        change: [+1.2580% +1.6979% +2.1558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/4
                        time:   [116.25 ns 116.36 ns 116.47 ns]
                        change: [-68.894% -68.668% -68.507%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/5
                        time:   [948.41 ns 949.47 ns 950.65 ns]
                        change: [-1.5866% -1.3651% -1.1355%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/5
                        time:   [116.15 ns 116.26 ns 116.38 ns]
                        change: [-68.397% -68.363% -68.331%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild

@ashvardanian
Copy link
Owner

ashvardanian commented May 6, 2024

Is that all still on the same Ryzen CPU, @ChillFish8?

I was just refreshing the ParallelReductionsBenchmark and added a loop-unrolled variant with scalar code in the C++ layer. It still looses to SIMD even for f32:

$ build_release/reduce_bench
You did not feed the size of arrays, so we will use a 1GB array!
2024-05-06T00:11:14+00:00
Running build_release/reduce_bench
Run on (160 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x160)
  L1 Instruction 32 KiB (x160)
  L2 Unified 4096 KiB (x80)
  L3 Unified 16384 KiB (x2)
Load Average: 3.23, 19.01, 13.71
----------------------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------
unrolled<f32>/min_time:10.000/real_time                149618549 ns    149615366 ns           95 bytes/s=7.17653G/s error,%=50
unrolled<f64>/min_time:10.000/real_time                146594731 ns    146593719 ns           95 bytes/s=7.32456G/s error,%=0
avx2<f32>/min_time:10.000/real_time                    110796474 ns    110794861 ns          127 bytes/s=9.69112G/s error,%=50
avx2<f32kahan>/min_time:10.000/real_time               134144762 ns    134137771 ns          105 bytes/s=8.00435G/s error,%=0
avx2<f64>/min_time:10.000/real_time                    115791797 ns    115790878 ns          121 bytes/s=9.27304G/s error,%=0

You can find more results in that repos README.

@ChillFish8
Copy link
Contributor Author

Hey, yes but it is worth noting in my last comment what is happening under the hood, is LLVM is autovectorizing that loop and using FMA instructions because it's been allowed to assume AVX2 and FMA support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants