WIP Switch to a full bitwidth h2 #513

matthieu-m · 2024-03-24T16:53:15Z

Changes:

Use all values of h2, not just 130 of it.
Convert SSE2 implementation for benchmarking.
Add generic tests to verify that Group functions are correctly implemented.

Motivation:

Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%.

On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.

Limitations:

Only the SSE2 is ported at first, not the generic or neon ones, to gauge whether the performance looks worth it.

Design:

The values for EMPTY and DELETED are chosen so as to play well with SSE2, which does not have unsigned vectors. By using the top of the signed range, operations to distinguish between special and non-special are reduced to a single comparison, whereas using the middle of the range would require 2.

On the other hand, the convert_special_to_empty_and_full_to_deleted method is more complicated.

The results, on my machine, are not encouraging, but my machine is noisy so conducting proper benchmarking is tough.

* Changes: - Use all values of h2, not just 130 of it. - Convert SSE2 implementation for benchmarking. * Motivation: Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%. On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.

matthieu-m · 2024-03-25T17:11:35Z

@Amanieu Would it be possible to benchmark on SSE2?

I'd like to see if it's worth it before trying to make the generic and neon targets pass (especially the generics one, all that bit fiddling is a bit involved).

Amanieu · 2024-03-30T21:54:01Z

Here are my benchmark results:

 name                         old.txt ns/iter  new.txt ns/iter  diff ns/iter   diff %  speedup 
 clone_from_large             5,064            5,170                     106    2.09%   x 0.98 
 clone_from_small             46               46                          0    0.00%   x 1.00 
 clone_large                  5,141            5,220                      79    1.54%   x 0.98 
 clone_small                  61               62                          1    1.64%   x 0.98 
 grow_insert_ahash_highbits   19,525           20,297                    772    3.95%   x 0.96 
 grow_insert_ahash_random     19,889           20,877                    988    4.97%   x 0.95 
 grow_insert_ahash_serial     19,530           20,794                  1,264    6.47%   x 0.94 
 grow_insert_std_highbits     36,006           36,842                    836    2.32%   x 0.98 
 grow_insert_std_random       35,998           36,841                    843    2.34%   x 0.98 
 grow_insert_std_serial       35,786           36,865                  1,079    3.02%   x 0.97 
 insert                       13,911           8,763                  -5,148  -37.01%   x 1.59 
 insert_ahash_highbits        17,971           18,048                     77    0.43%   x 1.00 
 insert_ahash_random          18,071           17,950                   -121   -0.67%   x 1.01 
 insert_ahash_serial          17,980           17,867                   -113   -0.63%   x 1.01 
 insert_erase_ahash_highbits  18,830           19,945                  1,115    5.92%   x 0.94 
 insert_erase_ahash_random    19,060           19,897                    837    4.39%   x 0.96 
 insert_erase_ahash_serial    18,426           19,221                    795    4.31%   x 0.96 
 insert_erase_std_highbits    32,468           33,627                  1,159    3.57%   x 0.97 
 insert_erase_std_random      33,186           34,270                  1,084    3.27%   x 0.97 
 insert_erase_std_serial      32,487           33,589                  1,102    3.39%   x 0.97 
 insert_std_highbits          22,173           22,509                    336    1.52%   x 0.99 
 insert_std_random            22,249           22,673                    424    1.91%   x 0.98 
 insert_std_serial            22,203           22,440                    237    1.07%   x 0.99 
 insert_unique_unchecked      5,300            5,614                     314    5.92%   x 0.94 
 iter_ahash_highbits          928              927                        -1   -0.11%   x 1.00 
 iter_ahash_random            932              926                        -6   -0.64%   x 1.01 
 iter_ahash_serial            932              916                       -16   -1.72%   x 1.02 
 iter_std_highbits            930              938                         8    0.86%   x 0.99 
 iter_std_random              931              912                       -19   -2.04%   x 1.02 
 iter_std_serial              946              920                       -26   -2.75%   x 1.03 
 lookup_ahash_highbits        3,938            3,989                      51    1.30%   x 0.99 
 lookup_ahash_random          4,039            4,208                     169    4.18%   x 0.96 
 lookup_ahash_serial          3,829            3,929                     100    2.61%   x 0.97 
 lookup_fail_ahash_highbits   3,267            3,389                     122    3.73%   x 0.96 
 lookup_fail_ahash_random     3,316            3,502                     186    5.61%   x 0.95 
 lookup_fail_ahash_serial     3,367            3,332                     -35   -1.04%   x 1.01 
 lookup_fail_std_highbits     9,224            9,561                     337    3.65%   x 0.96 
 lookup_fail_std_random       9,337            9,620                     283    3.03%   x 0.97 
 lookup_fail_std_serial       9,246            9,318                      72    0.78%   x 0.99 
 lookup_std_highbits          9,996            10,278                    282    2.82%   x 0.97 
 lookup_std_random            10,077           10,681                    604    5.99%   x 0.94 
 lookup_std_serial            10,002           10,569                    567    5.67%   x 0.95 
 rehash_in_place              180,307          187,097                 6,790    3.77%   x 0.96

Overall, this seems like a loss. My guess is that the extra logic needed when handling h2 values is slowing things down.

matthieu-m · 2024-03-31T14:00:19Z

One thing I wonder, is how many collisions there are in the first place.

Perfect hashes should be akin to uniformly spread values across the 0-127 range today. Given the general formula of the birthday paradox, we get that the probability of at least one collision across 16 elements is 16 * 15 / (2 * 128) = 0.9375.

Thus even with only 128 values to choose from the chance of two elements having the same residual h2 is pretty low in the first place. Even in a full group -- which won't be the case most of the time -- the probability of at least one collision is only 93.75%... and if there's a single collision, it means 14 elements (out of 16) have no collision.

This means paying for the extra complexity for every element, but rarely ever needing it.

I tried my best to keep the cost low, but computing h2 is slightly harder, and the "magic" remapping on rehash is definitely not optimal. Perhaps someone with better insight could pick better values, and better instructions.

JustForFun88 · 2024-04-26T18:40:33Z

@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7) is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.

In addition, I suggest also considering the following possible implementation godbolt:

const EMPTY: u8 = 0b1111_1111;   // 255
const DELETED: u8 = 0b1111_1110; // 254

pub fn h2(hash: u64) -> u8 {
    let bit = hash as u8;
    bit >> ((bit > 253) as u8)
}

matthieu-m · 2024-04-27T10:36:46Z

@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7) is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.

I realize that this implementation is also faulty, I was aiming for the top 8 bits, but it seems we only get the top 7 here.

In addition, I suggest also considering the following possible implementation godbolt:

const EMPTY: u8 = 0b1111_1111;   // 255
const DELETED: u8 = 0b1111_1110; // 254

#[no_mangle]
pub fn h2(hash: u64) -> u8 {
    let bit = hash as u8;
    bit >> ((bit > 253) as u8)
}

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

The trick to shift by 1 to avoid colliding with special values is interesting, and may save cycles compared to an array lookup.

JustForFun88 · 2024-04-27T14:53:28Z

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

JustForFun88 · 2024-04-27T21:53:18Z

src/raw/mod.rs

    // value, some hash functions (such as FxHash) produce a usize result
    // instead, which means that the top 32 bits are 0 on 32-bit platforms.
    // So we use MIN_HASH_LEN constant to handle this.
-    let top7 = hash >> (MIN_HASH_LEN * 8 - 7);
-    (top7 & 0x7f) as u8 // truncation
+    let top8 = hash >> (MIN_HASH_LEN * 8 - 7);


Suggested change

let top8 = hash >> (MIN_HASH_LEN * 8 - 7);

let top8 = hash >> (MIN_HASH_LEN * 8 - 8);

Or just:

/// Secondary hash function, saved in the low 8 bits of the control byte. #[inline] #[allow(clippy::cast_possible_truncation)] fn h2(hash: u64) -> u8 { // Grab the low 8 bits of the hash. We use a 1 bit shift to the left if // the bit is equal to special values let bit = hash as u8; bit >> ((bit as i8 > (DELETED as i8 - 1)) as u8) }

JustForFun88 · 2024-04-27T22:08:18Z

@matthieu-m I carefully looked at the implementation and you are right, it is not possible to use values other than 127_i8 and 126_i8. In addition to your improvements, I also suggest changing the erase function:

const EMPTY: u8 = 0b0111_1111;   // 127
const DELETED: u8 = 0b0111_1110; // 126

impl RawTableInner {
    #[inline]
    unsafe fn erase(&mut self, index: usize) {
        debug_assert!(self.is_bucket_full(index));

        let index_before = index.wrapping_sub(Group::WIDTH) & self.bucket_mask;
        let empty_before = Group::load(self.ctrl(index_before)).match_empty();
        let empty_after = Group::load(self.ctrl(index)).match_empty();

        // Removing if
        let empty_group =
            (empty_before.leading_zeros() + empty_after.trailing_zeros() < Group::WIDTH) as u8;
        let ctrl = DELETED + empty_group;
        self.growth_left += empty_group as usize;

        self.set_ctrl(index, ctrl);
        self.items -= 1;
    }
}

It looks like it will improve removal performance: https://godbolt.org/z/5P9oTYe5z

JustForFun88 · 2024-04-28T06:45:14Z

src/raw/sse2.rs

+            let empty = x86::_mm_set1_epi8(EMPTY as i8);
+            let deleted = x86::_mm_set1_epi8(DELETED as i8);
+
+            let is_full = x86::_mm_cmplt_epi8(self.0, deleted);
+            let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0));
+


Suggested change

let empty = x86::_mm_set1_epi8(EMPTY as i8);

let deleted = x86::_mm_set1_epi8(DELETED as i8);

let is_full = x86::_mm_cmplt_epi8(self.0, deleted);

let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0));

// Find all special bytes. A byte is EMPTY or DELETED if it is greater than or equal to DELETED.

let is_special = x86::_mm_cmpgt_epi8(self.0, x86::_mm_set1_epi8(DELETED as i8 - 1));

// Computes the bitwise OR between array of EMPTY bytes (that represents special bytes)

// and array of DELETED bytes. The logic is based on manipulating by the low bit of the byte:

//

// - If the byte was equal to EMPTY (0111_1111), then bitwise OR with DELETED will

// not change its value (0111_1111 | 0111_1110 = 0111_1111);

// - If the byte was FULL (0000_0000), then bitwise OR with DELETED will make it

// DELETED (0000_0000 | 0111_1110 = 0111_1110)

JustForFun88 · 2024-04-28T06:45:37Z

src/raw/sse2.rs

+                x86::_mm_and_si128(is_full, deleted),
+                x86::_mm_and_si128(is_special, empty),


Suggested change

x86::_mm_and_si128(is_full, deleted),

x86::_mm_and_si128(is_special, empty),

// Converting all bytes that represent special bytes (`1111_1111`) to EMPTY `0111_1111`

// 1111_1111 & 0111_1111 = 0111_1111

// 0000_0000 & 0111_1111 = 0000_0000

x86::_mm_and_si128(is_special, x86::_mm_set1_epi8(EMPTY as i8)),

// Array of DELETED bytes

x86::_mm_set1_epi8(DELETED as i8),

This reduces the number of function calls from 8 to 6.

matthieu-m · 2024-04-28T10:18:30Z

@JustForFun88 You seem to have many (good!) suggestions, and unfortunately I don't have much bandwidth right now.

I can probably get to them eventually (I may have some time in 2-3 weeks), but that's a bit of an eternity momentum-wise.

So, if I may suggest, why don't you checkout the branch, apply your suggested changes, and run the benchmarks? You'll see quickly if it works out or not, and you'll be able to try further ideas as well.

matthieu-m · 2024-04-28T10:33:49Z

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

You're correct. h1 is used for indexing via % table length which takes the bottom bits, and Swiss Table (Abseil's implementation) uses the top bits for h2 (the residual).

Now, imagine a table of 256 slots:

index = h1 % 256 = hash % 256.
h2 = hash % 256.

And therein lies the rub. The goal of h2 is to provide a quick filter across the elements of a group of 16 (contiguous) elements: if you use h2 to decide which group the element goes in, then the h2s of the elements within the group are not uniformly distributed -- at all -- and thus it becomes a very poor filter, and performance will likely suffer.

Thus it's important to try and source h1 and h2 from different, non-overlapping bits, as much as possible. And taking top and bottom is the easiest way to do so. Which of the two takes top and which takes bottom shouldn't matter -- with a 64-bits hash -- so if it's faster for h2 to take the bottom bits, then defining h1 as hash >> 8 would work.

morrisonlevi · 2024-05-22T22:14:13Z

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

Yes, because only roughly 48 bits are actually addressable on x86_64 presently. And I'm not sure of any systems which even approach that size... so when the h1 gets truncated to the bucket mask, those upper bits aren't looked at. It's not worth doing any work to shrink down to the lower 57 bits because those will get filtered by the mask anyway.

bors · 2024-09-17T21:26:07Z

☔ The latest upstream changes (presumably #558) made this pull request unmergeable. Please resolve the merge conflicts.

matthieu-m force-pushed the performance/full-bitwidth-h2 branch from 61ed56b to 2098bd4 Compare March 24, 2024 16:54

matthieu-m mentioned this pull request Apr 6, 2024

Was swap-remove behavior ever considered when removing entries? #503

Open

JustForFun88 reviewed Apr 27, 2024

View reviewed changes

JustForFun88 reviewed Apr 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Switch to a full bitwidth h2 #513

WIP Switch to a full bitwidth h2 #513

matthieu-m commented Mar 24, 2024

matthieu-m commented Mar 25, 2024

Amanieu commented Mar 30, 2024

matthieu-m commented Mar 31, 2024

JustForFun88 commented Apr 26, 2024 •

edited

Loading

matthieu-m commented Apr 27, 2024

JustForFun88 commented Apr 27, 2024

JustForFun88 Apr 27, 2024 •

edited

Loading

JustForFun88 Apr 27, 2024

JustForFun88 commented Apr 27, 2024 •

edited

Loading

JustForFun88 Apr 28, 2024

JustForFun88 Apr 28, 2024 •

edited

Loading

JustForFun88 Apr 28, 2024

matthieu-m commented Apr 28, 2024

matthieu-m commented Apr 28, 2024

morrisonlevi commented May 22, 2024

bors commented Sep 17, 2024

	let top8 = hash >> (MIN_HASH_LEN * 8 - 7);
	let top8 = hash >> (MIN_HASH_LEN * 8 - 8);

-            let empty = x86::_mm_set1_epi8(EMPTY as i8);
-            let deleted = x86::_mm_set1_epi8(DELETED as i8);
-            let is_full = x86::_mm_cmplt_epi8(self.0, deleted);
-            let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0));
+            // Find all special bytes. A byte is EMPTY or DELETED if it is greater than or equal to DELETED.
+            let is_special = x86::_mm_cmpgt_epi8(self.0, x86::_mm_set1_epi8(DELETED as i8 - 1));
+            // Computes the bitwise OR between array of EMPTY bytes (that represents special bytes)
+            // and array of DELETED bytes. The logic is based on manipulating by the low bit of the byte:
+            //
+            // - If the byte was equal to EMPTY (0111_1111), then bitwise OR with DELETED will
+            //   not change its value (0111_1111 | 0111_1110 = 0111_1111);
+            // - If the byte was FULL (0000_0000), then bitwise OR with DELETED will make it
+            //   DELETED (0000_0000 | 0111_1110 = 0111_1110)

		x86::_mm_and_si128(is_full, deleted),
		x86::_mm_and_si128(is_special, empty),

-                x86::_mm_and_si128(is_full, deleted),
-                x86::_mm_and_si128(is_special, empty),
+                // Converting all bytes that represent special bytes (`1111_1111`) to EMPTY `0111_1111`
+                // 1111_1111 & 0111_1111 = 0111_1111
+                // 0000_0000 & 0111_1111 = 0000_0000
+                x86::_mm_and_si128(is_special, x86::_mm_set1_epi8(EMPTY as i8)),
+                // Array of DELETED bytes
+                x86::_mm_set1_epi8(DELETED as i8),

WIP Switch to a full bitwidth h2 #513

Are you sure you want to change the base?

WIP Switch to a full bitwidth h2 #513

Conversation

matthieu-m commented Mar 24, 2024

Changes:

Motivation:

Limitations:

Design:

matthieu-m commented Mar 25, 2024

Amanieu commented Mar 30, 2024

matthieu-m commented Mar 31, 2024

JustForFun88 commented Apr 26, 2024 • edited Loading

matthieu-m commented Apr 27, 2024

JustForFun88 commented Apr 27, 2024

JustForFun88 Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

JustForFun88 Apr 27, 2024

Choose a reason for hiding this comment

JustForFun88 commented Apr 27, 2024 • edited Loading

JustForFun88 Apr 28, 2024

Choose a reason for hiding this comment

JustForFun88 Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

JustForFun88 Apr 28, 2024

Choose a reason for hiding this comment

matthieu-m commented Apr 28, 2024

matthieu-m commented Apr 28, 2024

morrisonlevi commented May 22, 2024

bors commented Sep 17, 2024

JustForFun88 commented Apr 26, 2024 •

edited

Loading

JustForFun88 Apr 27, 2024 •

edited

Loading

JustForFun88 commented Apr 27, 2024 •

edited

Loading

JustForFun88 Apr 28, 2024 •

edited

Loading