-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SIMD to speed up clamping #2020
Conversation
Godbolt version instantiated with concrete types so you can see the assembly at various ISA levels: https://godbolt.org/z/dvaMTbPqq |
I've been in the process of forking the WebP codec out into its own crate here: https://github.com/image-rs/image-webp, so that would be a better place for non-urgent changes to the codec. There's also an in progress draft PR that (among many other things) incidentally modifies Overall, I'd say I'm hesitant but not opposed to making the code messier in order to make it vectorize better |
I see that image-rs/image-webp#2 refactors this function. Should I make this change to the version currently in master, or build on top that PR? |
I finished that other PR, so you can now work directly again the Though one thing to point out is that lossless and lossy WebP are basically separate formats and have different performance bottlenecks (though the alpha channel of lossy WebP can be encoded with the lossless format). Lossy involves:
Lossless has a few bottlenecks. They'd all need to be addressed to make a decoding truly fast:
|
In my benchmarks the lossy decoding is 2x slower than Fortunately WebP images are usually rather small (unlike JPEG which is commonly used for large photos), so it is not very noticeable in practice in interactive use cases. |
The rewritten YUV to RGB conversion is not so amenable to optimization, probably due to the additional branches introduced there. When inlined, clamping does not emit SIMD instructions; and once I did enough transforms to the code to make it emit them again, llvm-mca claimed I actually made the code slower. |
Leverage autovectorization to speed up clamping.
Improves end-to-end WebP decoding performance by 2% on x86_64 baseline; should be even better with more recent instructions, and even benefits from AVX-512.
Not a whole lot of an improvement because Huffman decoding takes up so much time that everything else is trivial by comparison, see image-rs/image-webp#55; but this will become a significant improvement in the future once Huffman is sped up.
Fixes #2019
I'm opening this to get feedback on the basic direction. If you're OK with the approach I can add it to other parts of the codebase for a few % speedups in other places too.