-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt to force fastmath at the kernel level #550
base: main
Are you sure you want to change the base?
Conversation
So for
But I don't know what I also don't know what to
I am working off of this:
To figure out what the arguments should be |
Thank you! Here an example of a kernel that should return a different result when fastmath is turned on:
|
The
So you would need to wrap the entire code blocks inside the macro with I, personally, have only made bad experiences with fastmath... In fact for @evelyne-ringoot example
I would prefer a:
To mark the reduction chain as reassociatable. |
tbh, I also have only had bad experiences with fastmath, so I am not sure if we should merge this in the end. I figured I would just get it out in the world and then we could discuss it further. That said, it is an argument that is available from the "So you would need to wrap the entire code blocks inside the macro with fastmath"
If not, I guess we can:
|
So
It actually doesn't set fastmath on the IR level, but as a compiler flag see JuliaGPU/GPUCompiler.jl#492 which then tells codegen to set flush-to-zero and prec_sqrt. So that's quite a different beast than So this variant seems more like https://github.com/JuliaGPU/CUDA.jl/blob/4e9513b8a4e56629a236b58504d609b1775a8236/src/CUDAKernels.jl#L18 |
I guess there is no fastmath flag equivalent for metal / parallel cpu backends, so it is hard to set this in a generic way for everyone. Maybe we should just ask @evelyne-ringoot (and other potential users) if it's fine to just |
From my side, I'm mostly looking to access the nvcc --use_fast_math equivalent in CUDA through KernelAbstractions, I am not sure whether this would work with the line-by-line solution? + Also the non-associative properties (I kindof assumed this would come with nvcc --use_fast_math , but am not sure) |
I think we can straight-forwardly expose the compiler semantics, but matching the language semantics is much harder. |
I've got an example where the julia compiler is not optimizing as much as nvcc and where Im not sure how to prompt it what to do differently. This is the C++ version , and the julia version and julia PTX. Julia speed is approximately equal to C++ speed without fastmath, but fastmath increases speed by about a factor 1.5x. I believe this has to do with the associative sums: in a previous version of the code, I had a split-k set-up, which improved performance in julia (ptx), but did not make a difference in C++ (also on the goodbolt link). Also I am interested in the fast sqrt for a different kernel in the same code! Generally nvcc --use_fast_math is an easily accesible flag (even if dangerous), while figuring out what parts of the code are optimized by it is less trivial. Perhaps a 'level' of fastmath could be a way to maintain code integrity while giving the user flexibility on how much accuracy/security they are willing to sacrifice for 'quick and dirty' performance? |
I had a request to do #429 for fastmath, so here is my attempt. #431 is also related
2 issues:
My only issue is that I don't know what a good test should be, so I don't know if it works or not. Is there a good demo of some difference in precision from
@fastmath
and just normal execution?I am also not sure what the equivalent lines are here (https://github.com/JuliaGPU/KernelAbstractions.jl/blob/main/src/macros.jl#L126):
Expr(:fastmath, true)
is not valid.