Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to force fastmath at the kernel level #550

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

leios
Copy link
Contributor

@leios leios commented Dec 14, 2024

I had a request to do #429 for fastmath, so here is my attempt. #431 is also related

2 issues:

  1. My only issue is that I don't know what a good test should be, so I don't know if it works or not. Is there a good demo of some difference in precision from @fastmath and just normal execution?

  2. I am also not sure what the equivalent lines are here (https://github.com/JuliaGPU/KernelAbstractions.jl/blob/main/src/macros.jl#L126):

    if force_inbounds
        push!(new_stmts, Expr(:inbounds, true))
    end

Expr(:fastmath, true) is not valid.

@leios
Copy link
Contributor Author

leios commented Dec 14, 2024

So for transform_cpu(...), it should be something like...

    if force_fastmath
        push!(new_stmts, Expr(:macrocall, :@fastmath, arg2, arg3))
    end

But I don't know what arg2, and arg3 are

I also don't know what to :pop below to mimic:

    if force_inbounds
        push!(new_stmts, Expr(:inbounds, :pop))
    end

I am working off of this:

julia> ex = :(@fastmath sqrt(1+1))
:(#= REPL[6]:1 =# @fastmath sqrt(1 + 1))

julia> ex.args
3-element Vector{Any}:
 Symbol("@fastmath")
 :(#= REPL[6]:1 =#)
 :(sqrt(1 + 1))

julia> ex.head
:macrocall

To figure out what the arguments should be

@evelyne-ringoot
Copy link
Contributor

Thank you! Here an example of a kernel that should return a different result when fastmath is turned on:

input=Float32.(ones(1e5))
input[1]=1e8

@kernel function nonassociative_sum!(input, output)
      sum =0.0f0;
      for i in 1:length(input)
            @inbounds sum+= input[i]
      end
      output[1]=sum
end
  
function my_nonassociative_sum!(input)
      backend = get_backend(input)
      output = KernelAbstractions.zeros(backend, Float32, 1)
  
      kernel = nonassociative_sum!(backend)
      kernel(input, output, ndrange = 1)
      return output
  end
 
  
my_nonassociative_sum!(input) #1e8
sum(input) #1e8+ 1e5

@vchuravy
Copy link
Member

The fastmath macro is sadly not region based like inbounds, but rather statement based...

@macroexpand @fastmath 1 + 1
:(Base.FastMath.add_fast(1, 1))

So you would need to wrap the entire code blocks inside the macro with fastmath.

I, personally, have only made bad experiences with fastmath...

In fact for @evelyne-ringoot example

@kernel function nonassociative_sum!(input, output)
      sum =0.0f0;
      for i in 1:length(input)
            @inbounds sum+= input[i]
      end
      output[1]=sum
end

I would prefer a:

@kernel function nonassociative_sum!(input, output)
      sum =0.0f0;
      @simd for i in 1:length(input)
            @inbounds sum+= input[i]
      end
      output[1]=sum
end

To mark the reduction chain as reassociatable.

@leios
Copy link
Contributor Author

leios commented Dec 16, 2024

tbh, I also have only had bad experiences with fastmath, so I am not sure if we should merge this in the end. I figured I would just get it out in the world and then we could discuss it further.

That said, it is an argument that is available from the @cuda launch macro that we do not have in KA.

"So you would need to wrap the entire code blocks inside the macro with fastmath"
Like this on the GPU side?

   if force_fastmath
        body = quote
            @fastmath $(body)
        end
    end

If not, I guess we can:

  1. dig into CUDA a bit to see how this is done, but I think it's on the NVIDIA side, not Julia side, so I don't think we can learn much from there
  2. Ask users to just use @fastmath in front of specific lines they want fastmath for. I am a little afriad of something going awry somewhere else in the kernel because fastmath is enabled for the whole kernel.

@vchuravy
Copy link
Member

So @cuda fastmath=true is a weird beast.

  • fastmath: use less precise square roots and flush denormals

It actually doesn't set fastmath on the IR level, but as a compiler flag see JuliaGPU/GPUCompiler.jl#492 which then tells codegen to set flush-to-zero and prec_sqrt.

So that's quite a different beast than @fastmath in Julia. And indeed users can use these independently from each other, since cuda also supports @fastmath.

So this variant seems more like https://github.com/JuliaGPU/CUDA.jl/blob/4e9513b8a4e56629a236b58504d609b1775a8236/src/CUDAKernels.jl#L18

@leios
Copy link
Contributor Author

leios commented Dec 16, 2024

I guess there is no fastmath flag equivalent for metal / parallel cpu backends, so it is hard to set this in a generic way for everyone.

Maybe we should just ask @evelyne-ringoot (and other potential users) if it's fine to just @fastmath in front of specific lines instead of at the kernel level.

@evelyne-ringoot
Copy link
Contributor

evelyne-ringoot commented Dec 16, 2024

From my side, I'm mostly looking to access the nvcc --use_fast_math equivalent in CUDA through KernelAbstractions, I am not sure whether this would work with the line-by-line solution? + Also the non-associative properties (I kindof assumed this would come with nvcc --use_fast_math , but am not sure)

@vchuravy
Copy link
Member

nvcc --use_fast_math is ill-defined. It is both a language level semantic change and a backend compiler change.

I think we can straight-forwardly expose the compiler semantics, but matching the language semantics is much harder.
Maybe we could gather examples with CUDA C in godbolt.org? And then look at what Julia + CUDA.jl create and if they match semantics.

@evelyne-ringoot
Copy link
Contributor

I've got an example where the julia compiler is not optimizing as much as nvcc and where Im not sure how to prompt it what to do differently. This is the C++ version , and the julia version and julia PTX. Julia speed is approximately equal to C++ speed without fastmath, but fastmath increases speed by about a factor 1.5x.

I believe this has to do with the associative sums: in a previous version of the code, I had a split-k set-up, which improved performance in julia (ptx), but did not make a difference in C++ (also on the goodbolt link).

Also I am interested in the fast sqrt for a different kernel in the same code! Generally nvcc --use_fast_math is an easily accesible flag (even if dangerous), while figuring out what parts of the code are optimized by it is less trivial. Perhaps a 'level' of fastmath could be a way to maintain code integrity while giving the user flexibility on how much accuracy/security they are willing to sacrifice for 'quick and dirty' performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants