attempt to force fastmath at the kernel level #550

leios · 2024-12-14T17:36:44Z

I had a request to do #429 for fastmath, so here is my attempt. #431 is also related

2 issues:

My only issue is that I don't know what a good test should be, so I don't know if it works or not. Is there a good demo of some difference in precision from @fastmath and just normal execution?
I am also not sure what the equivalent lines are here (https://github.com/JuliaGPU/KernelAbstractions.jl/blob/main/src/macros.jl#L126):

    if force_inbounds
        push!(new_stmts, Expr(:inbounds, true))
    end

Expr(:fastmath, true) is not valid.

leios · 2024-12-14T18:26:31Z

So for transform_cpu(...), it should be something like...

    if force_fastmath
        push!(new_stmts, Expr(:macrocall, :@fastmath, arg2, arg3))
    end

But I don't know what arg2, and arg3 are

I also don't know what to :pop below to mimic:

    if force_inbounds
        push!(new_stmts, Expr(:inbounds, :pop))
    end

I am working off of this:

julia> ex = :(@fastmath sqrt(1+1))
:(#= REPL[6]:1 =# @fastmath sqrt(1 + 1))

julia> ex.args
3-element Vector{Any}:
 Symbol("@fastmath")
 :(#= REPL[6]:1 =#)
 :(sqrt(1 + 1))

julia> ex.head
:macrocall

To figure out what the arguments should be

evelyne-ringoot · 2024-12-15T14:19:06Z

Thank you! Here an example of a kernel that should return a different result when fastmath is turned on:

input=Float32.(ones(1e5))
input[1]=1e8

@kernel function nonassociative_sum!(input, output)
      sum =0.0f0;
      for i in 1:length(input)
            @inbounds sum+= input[i]
      end
      output[1]=sum
end
  
function my_nonassociative_sum!(input)
      backend = get_backend(input)
      output = KernelAbstractions.zeros(backend, Float32, 1)
  
      kernel = nonassociative_sum!(backend)
      kernel(input, output, ndrange = 1)
      return output
  end
 
  
my_nonassociative_sum!(input) #1e8
sum(input) #1e8+ 1e5

vchuravy · 2024-12-16T12:49:09Z

The fastmath macro is sadly not region based like inbounds, but rather statement based...

@macroexpand @fastmath 1 + 1

:(Base.FastMath.add_fast(1, 1))

So you would need to wrap the entire code blocks inside the macro with fastmath.

I, personally, have only made bad experiences with fastmath...

In fact for @evelyne-ringoot example

@kernel function nonassociative_sum!(input, output)
      sum =0.0f0;
      for i in 1:length(input)
            @inbounds sum+= input[i]
      end
      output[1]=sum
end

I would prefer a:

@kernel function nonassociative_sum!(input, output)
      sum =0.0f0;
      @simd for i in 1:length(input)
            @inbounds sum+= input[i]
      end
      output[1]=sum
end

To mark the reduction chain as reassociatable.

leios · 2024-12-16T12:56:13Z

tbh, I also have only had bad experiences with fastmath, so I am not sure if we should merge this in the end. I figured I would just get it out in the world and then we could discuss it further.

That said, it is an argument that is available from the @cuda launch macro that we do not have in KA.

"So you would need to wrap the entire code blocks inside the macro with fastmath"
Like this on the GPU side?

   if force_fastmath
        body = quote
            @fastmath $(body)
        end
    end

If not, I guess we can:

dig into CUDA a bit to see how this is done, but I think it's on the NVIDIA side, not Julia side, so I don't think we can learn much from there
Ask users to just use @fastmath in front of specific lines they want fastmath for. I am a little afriad of something going awry somewhere else in the kernel because fastmath is enabled for the whole kernel.

vchuravy · 2024-12-16T13:09:05Z

So @cuda fastmath=true is a weird beast.

fastmath: use less precise square roots and flush denormals

It actually doesn't set fastmath on the IR level, but as a compiler flag see JuliaGPU/GPUCompiler.jl#492 which then tells codegen to set flush-to-zero and prec_sqrt.

So that's quite a different beast than @fastmath in Julia. And indeed users can use these independently from each other, since cuda also supports @fastmath.

So this variant seems more like https://github.com/JuliaGPU/CUDA.jl/blob/4e9513b8a4e56629a236b58504d609b1775a8236/src/CUDAKernels.jl#L18

leios · 2024-12-16T13:31:55Z

I guess there is no fastmath flag equivalent for metal / parallel cpu backends, so it is hard to set this in a generic way for everyone.

Maybe we should just ask @evelyne-ringoot (and other potential users) if it's fine to just @fastmath in front of specific lines instead of at the kernel level.

evelyne-ringoot · 2024-12-16T13:49:46Z

From my side, I'm mostly looking to access the nvcc --use_fast_math equivalent in CUDA through KernelAbstractions, I am not sure whether this would work with the line-by-line solution? + Also the non-associative properties (I kindof assumed this would come with nvcc --use_fast_math , but am not sure)

vchuravy · 2024-12-16T14:45:25Z

nvcc --use_fast_math is ill-defined. It is both a language level semantic change and a backend compiler change.

I think we can straight-forwardly expose the compiler semantics, but matching the language semantics is much harder.
Maybe we could gather examples with CUDA C in godbolt.org? And then look at what Julia + CUDA.jl create and if they match semantics.

evelyne-ringoot · 2024-12-18T21:38:16Z

I've got an example where the julia compiler is not optimizing as much as nvcc and where Im not sure how to prompt it what to do differently. This is the C++ version , and the julia version and julia PTX. Julia speed is approximately equal to C++ speed without fastmath, but fastmath increases speed by about a factor 1.5x.

I believe this has to do with the associative sums: in a previous version of the code, I had a split-k set-up, which improved performance in julia (ptx), but did not make a difference in C++ (also on the goodbolt link).

Also I am interested in the fast sqrt for a different kernel in the same code! Generally nvcc --use_fast_math is an easily accesible flag (even if dangerous), while figuring out what parts of the code are optimized by it is less trivial. Perhaps a 'level' of fastmath could be a way to maintain code integrity while giving the user flexibility on how much accuracy/security they are willing to sacrifice for 'quick and dirty' performance?

fastmath demo

0d1964b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attempt to force fastmath at the kernel level #550

attempt to force fastmath at the kernel level #550

leios commented Dec 14, 2024 •

edited

Loading

leios commented Dec 14, 2024

evelyne-ringoot commented Dec 15, 2024

vchuravy commented Dec 16, 2024

leios commented Dec 16, 2024

vchuravy commented Dec 16, 2024

leios commented Dec 16, 2024

evelyne-ringoot commented Dec 16, 2024 •

edited

Loading

vchuravy commented Dec 16, 2024

evelyne-ringoot commented Dec 18, 2024

attempt to force fastmath at the kernel level #550

Are you sure you want to change the base?

attempt to force fastmath at the kernel level #550

Conversation

leios commented Dec 14, 2024 • edited Loading

leios commented Dec 14, 2024

evelyne-ringoot commented Dec 15, 2024

vchuravy commented Dec 16, 2024

leios commented Dec 16, 2024

vchuravy commented Dec 16, 2024

leios commented Dec 16, 2024

evelyne-ringoot commented Dec 16, 2024 • edited Loading

vchuravy commented Dec 16, 2024

evelyne-ringoot commented Dec 18, 2024

leios commented Dec 14, 2024 •

edited

Loading

evelyne-ringoot commented Dec 16, 2024 •

edited

Loading