-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] Handle denorms properly for exp2 and exp #3816
base: main
Are you sure you want to change the base?
Conversation
21f3f15
to
66231b2
Compare
// of the value of kernel arg `allow_flush_denorm`. | ||
// 2. If __HIP_FTZ = 0, whether exp2 flushes denorms in input and output | ||
// depends on the value of kernel arg `allow_flush_denorm`. | ||
// 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for clang, this is subtarget dependent. Anything gfx9+ defaults to not flushing f32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the default to flushing to match the behavior on the NV path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check with different SM version targets? I think they also switch the default at some level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From NV libdevice, they use __CUDA_FTZ
to control whether .ftz
should be used. I checked the ptx and sass on A100. I can check H100 to make sure this is also the case there.
Converting to draft after offline discussion with Matt. We may need to figure out a way to have good perf without "breaking" the exp/exp2 ops. |
The global __HIP_FTZ/__CUDA_FTZ should imply the global FP mode setting. If those are enabled you should be able to set denormal-fp-math-f32=preserve-sign,preserve-sign on every function and the backend will handle this. You only need additional complexity if you need to contextually ignore denormal handling with standard IEEE handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks to really digging into all the details about this, @zhanglx13! Much appreciated. I have some comments inlined around a few issues. Aside from them, could you also update the language doc for exp2 to be clear on the denorm flushing behavior?
ee196f4
to
19fe675
Compare
@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks! Just two more nits. LGTM.
// For non-FP32 input, call __ocml_exp2_f64 for higher-precision calculation | ||
if (elemTy.getIntOrFloatBitWidth() != 32) | ||
return {}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here worth a comment saying the former flushes denorm values and the later expands to LLVM instructions to handle denorm values + the former.
patterns.add<ExpOpConversionApprox>(typeConverter, axisInfoAnalysis, benefit); | ||
// Exp2OpConversion will use llvm.exp2.f32 or llvm.amdgcn.exp2.f32 | ||
// based on the __HIP_FTZ flag if the input type is FP32. For FP64 input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace __HIP_FTZ
with ftz
here--ftz
is a parameter to the function so it's clear what we are referring to.
I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though. |
I'll do it in a follow up PR with an update for all the math function in the doc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this supposed tested? I don't see any lit tests
## depends on the value of kernel arg `allow_flush_denorm`. | ||
## 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument. | ||
## For now it is used as a controller for developers only. | ||
__HIP_FTZ = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this be specifically control exp makes no sense. The name, by lack of exp, and similarity to the module level __CUDA_FTZ, would imply this is changing the global floating point environment denormal mode. There should be no special case modes for a specific operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to use this __HIP_FTZ
to control denorm handling behavior for all related math functions. This PR serves as a first step and selects exp2 because it affects performance of FA kernels.
__CUDA_FTZ
is used only in the libdevice.bc. We don't have the DAZ_OPT
flag in AMD ocml.bc anymore, so I exposed this flag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose splitting it this way:
- Fix this to start directly emitting the llvm intrinsic directly
- Holistically change the mode to use FTZ, then you don't need to special case anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fix this to start directly emitting the llvm intrinsic directly
In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.
- Holistically change the mode to use FTZ, then you don't need to special case anything
Do you mean we should just use denorm-fp-math-f32
to control whether denorms are flushed for exp2, as with all other valu operations? Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.
Correct
Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations?
More precisely, this changes the default floating point mode.
Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.
Sounds like an Nvidia bug to me, unless this is a specific "fast" exp function. You shouldn't be touching any global module flag for this
I don't see any lit tests for denorm handling on the NV side. So I assume we don't need to check for instructions to handle denorms. |
Bad test coverage elsewhere isn't a good reason to not have tests. Any compiler codegen change should have lit testing
There is an output change, it should be tested |
@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode. |
This PR enables denorm flushing for
tl.math.exp2
and preserves denorms fortl.math.exp
, which match their behaviors on Nvidia backend.More specifically,
__CUDA_FTZ
or__HIP_FTZ
and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument.and instructions to
check and adjust for
denorms
and instructions
to check and
adjust for
denorms
.ftz
flag.