Use SIMD to speed up clamping #2020

Shnatsel · 2023-09-29T19:23:06Z

Leverage autovectorization to speed up clamping.

Improves end-to-end WebP decoding performance by 2% on x86_64 baseline; should be even better with more recent instructions, and even benefits from AVX-512.

Not a whole lot of an improvement because Huffman decoding takes up so much time that everything else is trivial by comparison, see image-rs/image-webp#55; but this will become a significant improvement in the future once Huffman is sped up.

Fixes #2019

I'm opening this to get feedback on the basic direction. If you're OK with the approach I can add it to other parts of the codebase for a few % speedups in other places too.

Shnatsel · 2023-09-29T19:28:49Z

Godbolt version instantiated with concrete types so you can see the assembly at various ISA levels: https://godbolt.org/z/dvaMTbPqq

fintelia · 2023-09-29T22:04:08Z

I've been in the process of forking the WebP codec out into its own crate here: https://github.com/image-rs/image-webp, so that would be a better place for non-urgent changes to the codec. There's also an in progress draft PR that (among many other things) incidentally modifies fill_single to have bit exact output matching libwebp

Overall, I'd say I'm hesitant but not opposed to making the code messier in order to make it vectorize better

Shnatsel · 2023-09-30T23:19:45Z

I see that image-rs/image-webp#2 refactors this function. Should I make this change to the version currently in master, or build on top that PR?

fintelia · 2023-10-02T02:52:34Z

I finished that other PR, so you can now work directly again the main branch there.

Though one thing to point out is that lossless and lossy WebP are basically separate formats and have different performance bottlenecks (though the alpha channel of lossy WebP can be encoded with the lossless format).

Lossy involves:

VP8 decoding - Haven't worked on this part of the code, so I'm not actually sure how optimal this is.
YUV -> RGB - This is where clamping happens. I believe the reference encoder does this entirely with SIMD.

Lossless has a few bottlenecks. They'd all need to be addressed to make a decoding truly fast:

Read bits - The current strategy is pointlessly inefficient, this blog describes how to do better.
Huffman decoding - Decoding tables will help, but the format allows images to have thousands of huffman trees so if we aren't careful the precomputations could make things worse rather than better.
Predictor - This is analogous to PNG's unfiltering. Unfortunately WebP lets images switch predictors for each 4 pixel block, so using SIMD may be even harder (the reference decoder doesn't even try).

Shnatsel · 2023-10-03T14:45:11Z

In my benchmarks the lossy decoding is 2x slower than dwebp with assembly optimizations disabled, or 4x slower than dwebp with assembly optimizations enabled.

Fortunately WebP images are usually rather small (unlike JPEG which is commonly used for large photos), so it is not very noticeable in practice in interactive use cases.

Shnatsel · 2023-10-07T12:35:58Z

The rewritten YUV to RGB conversion is not so amenable to optimization, probably due to the additional branches introduced there. When inlined, clamping does not emit SIMD instructions; and once I did enough transforms to the code to make it emit them again, llvm-mca claimed I actually made the code slower.

Shnatsel added 2 commits September 29, 2023 20:06

Add a SIMD clamping function

16c923e

Change vp8 decoder to use SIMD clamping

4aa41dd

Shnatsel requested a review from fintelia September 29, 2023 19:23

Shnatsel mentioned this pull request Sep 29, 2023

Use SIMD via autovec in macroblock_filter #2021

Closed

Shnatsel closed this Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SIMD to speed up clamping #2020

Use SIMD to speed up clamping #2020

Shnatsel commented Sep 29, 2023

Shnatsel commented Sep 29, 2023

fintelia commented Sep 29, 2023

Shnatsel commented Sep 30, 2023

fintelia commented Oct 2, 2023

Shnatsel commented Oct 3, 2023

Shnatsel commented Oct 7, 2023

Use SIMD to speed up clamping #2020

Use SIMD to speed up clamping #2020

Conversation

Shnatsel commented Sep 29, 2023

Shnatsel commented Sep 29, 2023

fintelia commented Sep 29, 2023

Shnatsel commented Sep 30, 2023

fintelia commented Oct 2, 2023

Shnatsel commented Oct 3, 2023

Shnatsel commented Oct 7, 2023