Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Handle denorms properly for exp2 and exp #3816

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zhanglx13
Copy link
Collaborator

@zhanglx13 zhanglx13 commented May 2, 2024

This PR enables denorm flushing for tl.math.exp2 and preserves denorms for tl.math.exp, which match their behaviors on Nvidia backend.

More specifically,

  • denorm flushing for tl.math.exp2 with f32 inputs is controlled by __CUDA_FTZ or __HIP_FTZ and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument.
tl.math.exp2(f32) NV NV AMD AMD
control flag __CUDA_FTZ=1 (default) __CUDA_FTZ=0 __HIP_FTZ=1 (default) __HIP_FTZ=0
device lib __nv_exp2f __nv_exp2f
llvm intrinsics llvm.nvvm.ex2.approx.ftz.f llvm.nvvm.ex2.approx.f llvm.amdgcn.exp2.f32 llvm.exp2.f32
ptx ex2.approx.ftz.f32 ex2.approx.f32    
sass/amdgcn MUFU.EX2 MUFU.EX2
and instructions to
check and adjust for
denorms
v_exp_f32 v_exp_f32
and instructions
to check and
adjust for
denorms
  • denorms are preserved for tl.math.exp2 with f64 inputs
tl.math.exp2(f64) NV AMD
device lib __nv_exp2 __ocml_exp2_f64
  • denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the .ftz flag.
tl.math.exp(f32) NV AMD
llvm intrinsics   llvm.exp2.f32
ptx ex2.approx.f32  
tl.math.exp(f64) NV AMD
device lib __nv_exp __ocml_exp_f64

@zhanglx13 zhanglx13 force-pushed the exp2_ftz branch 3 times, most recently from 21f3f15 to 66231b2 Compare May 2, 2024 19:41
@zhanglx13 zhanglx13 changed the title [WIP] [AMD] Enable denorm flushing for exp2 [AMD] Handle denorms properly for exp2 and exp May 3, 2024
@zhanglx13 zhanglx13 marked this pull request as ready for review May 3, 2024 02:56
@zhanglx13 zhanglx13 requested a review from zahimoud May 3, 2024 02:56
// of the value of kernel arg `allow_flush_denorm`.
// 2. If __HIP_FTZ = 0, whether exp2 flushes denorms in input and output
// depends on the value of kernel arg `allow_flush_denorm`.
// 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for clang, this is subtarget dependent. Anything gfx9+ defaults to not flushing f32

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the default to flushing to match the behavior on the NV path.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check with different SM version targets? I think they also switch the default at some level

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From NV libdevice, they use __CUDA_FTZ to control whether .ftz should be used. I checked the ptx and sass on A100. I can check H100 to make sure this is also the case there.

@zhanglx13 zhanglx13 marked this pull request as draft May 3, 2024 18:01
@zhanglx13
Copy link
Collaborator Author

Converting to draft after offline discussion with Matt. We may need to figure out a way to have good perf without "breaking" the exp/exp2 ops.

@arsenm
Copy link

arsenm commented May 3, 2024

The global __HIP_FTZ/__CUDA_FTZ should imply the global FP mode setting. If those are enabled you should be able to set denormal-fp-math-f32=preserve-sign,preserve-sign on every function and the backend will handle this. You only need additional complexity if you need to contextually ignore denormal handling with standard IEEE handling

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks to really digging into all the details about this, @zhanglx13! Much appreciated. I have some comments inlined around a few issues. Aside from them, could you also update the language doc for exp2 to be clear on the denorm flushing behavior?

@zhanglx13 zhanglx13 force-pushed the exp2_ftz branch 2 times, most recently from ee196f4 to 19fe675 Compare May 27, 2024 02:53
@zhanglx13
Copy link
Collaborator Author

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! Just two more nits. LGTM.

// For non-FP32 input, call __ocml_exp2_f64 for higher-precision calculation
if (elemTy.getIntOrFloatBitWidth() != 32)
return {};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here worth a comment saying the former flushes denorm values and the later expands to LLVM instructions to handle denorm values + the former.

patterns.add<ExpOpConversionApprox>(typeConverter, axisInfoAnalysis, benefit);
// Exp2OpConversion will use llvm.exp2.f32 or llvm.amdgcn.exp2.f32
// based on the __HIP_FTZ flag if the input type is FP32. For FP64 input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace __HIP_FTZ with ftz here--ftz is a parameter to the function so it's clear what we are referring to.

@antiagainst antiagainst marked this pull request as ready for review May 29, 2024 03:37
@antiagainst
Copy link
Collaborator

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though.

@zhanglx13
Copy link
Collaborator Author

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though.

I'll do it in a follow up PR with an update for all the math function in the doc.

Copy link

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this supposed tested? I don't see any lit tests

## depends on the value of kernel arg `allow_flush_denorm`.
## 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument.
## For now it is used as a controller for developers only.
__HIP_FTZ = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this be specifically control exp makes no sense. The name, by lack of exp, and similarity to the module level __CUDA_FTZ, would imply this is changing the global floating point environment denormal mode. There should be no special case modes for a specific operation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use this __HIP_FTZ to control denorm handling behavior for all related math functions. This PR serves as a first step and selects exp2 because it affects performance of FA kernels.
__CUDA_FTZ is used only in the libdevice.bc. We don't have the DAZ_OPT flag in AMD ocml.bc anymore, so I exposed this flag here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose splitting it this way:

  1. Fix this to start directly emitting the llvm intrinsic directly
  2. Holistically change the mode to use FTZ, then you don't need to special case anything

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Fix this to start directly emitting the llvm intrinsic directly

In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.

  1. Holistically change the mode to use FTZ, then you don't need to special case anything

Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations? Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.

Correct

Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations?

More precisely, this changes the default floating point mode.

Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.

Sounds like an Nvidia bug to me, unless this is a specific "fast" exp function. You shouldn't be touching any global module flag for this

@zhanglx13
Copy link
Collaborator Author

How is this supposed tested? I don't see any lit tests

I don't see any lit tests for denorm handling on the NV side. So I assume we don't need to check for instructions to handle denorms.
And this PR is aiming to improve the performance of FA kernels. And we have verified offline.

@arsenm
Copy link

arsenm commented May 30, 2024

How is this supposed tested? I don't see any lit tests

I don't see any lit tests for denorm handling on the NV side.

Bad test coverage elsewhere isn't a good reason to not have tests. Any compiler codegen change should have lit testing

So I assume we don't need to check for instructions to handle denorms.

There is an output change, it should be tested

@zhanglx13
Copy link
Collaborator Author

@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode.
Regarding the design of using __HIP_FTZ, what do you think @antiagainst?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants