Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SIMD to speed up clamping #2020

Closed
wants to merge 2 commits into from
Closed

Conversation

Shnatsel
Copy link
Contributor

Leverage autovectorization to speed up clamping.

Improves end-to-end WebP decoding performance by 2% on x86_64 baseline; should be even better with more recent instructions, and even benefits from AVX-512.

Not a whole lot of an improvement because Huffman decoding takes up so much time that everything else is trivial by comparison, see image-rs/image-webp#55; but this will become a significant improvement in the future once Huffman is sped up.

Fixes #2019

I'm opening this to get feedback on the basic direction. If you're OK with the approach I can add it to other parts of the codebase for a few % speedups in other places too.

@Shnatsel
Copy link
Contributor Author

Godbolt version instantiated with concrete types so you can see the assembly at various ISA levels: https://godbolt.org/z/dvaMTbPqq

@fintelia
Copy link
Contributor

I've been in the process of forking the WebP codec out into its own crate here: https://github.com/image-rs/image-webp, so that would be a better place for non-urgent changes to the codec. There's also an in progress draft PR that (among many other things) incidentally modifies fill_single to have bit exact output matching libwebp

Overall, I'd say I'm hesitant but not opposed to making the code messier in order to make it vectorize better

@Shnatsel
Copy link
Contributor Author

I see that image-rs/image-webp#2 refactors this function. Should I make this change to the version currently in master, or build on top that PR?

@fintelia
Copy link
Contributor

fintelia commented Oct 2, 2023

I finished that other PR, so you can now work directly again the main branch there.

Though one thing to point out is that lossless and lossy WebP are basically separate formats and have different performance bottlenecks (though the alpha channel of lossy WebP can be encoded with the lossless format).

Lossy involves:

  1. VP8 decoding - Haven't worked on this part of the code, so I'm not actually sure how optimal this is.
  2. YUV -> RGB - This is where clamping happens. I believe the reference encoder does this entirely with SIMD.

Lossless has a few bottlenecks. They'd all need to be addressed to make a decoding truly fast:

  1. Read bits - The current strategy is pointlessly inefficient, this blog describes how to do better.
  2. Huffman decoding - Decoding tables will help, but the format allows images to have thousands of huffman trees so if we aren't careful the precomputations could make things worse rather than better.
  3. Predictor - This is analogous to PNG's unfiltering. Unfortunately WebP lets images switch predictors for each 4 pixel block, so using SIMD may be even harder (the reference decoder doesn't even try).

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 3, 2023

In my benchmarks the lossy decoding is 2x slower than dwebp with assembly optimizations disabled, or 4x slower than dwebp with assembly optimizations enabled.

Fortunately WebP images are usually rather small (unlike JPEG which is commonly used for large photos), so it is not very noticeable in practice in interactive use cases.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 7, 2023

The rewritten YUV to RGB conversion is not so amenable to optimization, probably due to the additional branches introduced there. When inlined, clamping does not emit SIMD instructions; and once I did enough transforms to the code to make it emit them again, llvm-mca claimed I actually made the code slower.

@Shnatsel Shnatsel closed this Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

image::utils::clamp could be better optimized
2 participants