-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
I am thinking about using remarks to capture key optimization information and the information can be integrated into a performance analysis tool (Proton may be a good choice). CC @Jokeren
Remarks can show location for both .py source and .ttgir file. I haven't figured out how to filter Diagnostics for SourceMgrDiagnosticHandler. Currently we will also get some warnings: |
This is interesting. I need to think more about it though. Could we keep it as a draft PR and come back discussing it later. I promise that I won't forget :) |
Saved it to my github folder. |
Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata. I have two suggestions.
I may spend some time working on this soon :) Happy to chat if you have any thoughts |
Thanks for taking the time! |
Using a separate pass to get the information is better than instrumenting existing compiler passes. But still, I'm worried about parsing the remark. If instead we parse the GPU IR by calling
The design would be
|
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Yeah, I added a separate pass but there are decisions made during the lowering phase, such as vectorization, if a load instruction will be coalesced etc. For those, I am hoping to use optimization remarks. |
// Check to see if this op is coalesced. Depending on vectorization, we | ||
// can have varying number of instructions to perform the load. Each | ||
// instruction handles vec elements. Each thread handles numElems, and | ||
// next thread handles vec elements starting with sizePerThread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Discussions are now in the new PR: #3922 |
This will print remarks like below
01-vector-add.py:48:16: remark: vec = 4
x = tl.load(x_ptr + offsets, mask=mask)
^
With printOpOnDiagnostic being true, it will show
01-vector-add.py:49:16: remark: vec = 4
y = tl.load(y_ptr + offsets, mask=mask)
^
01-vector-add.py:49:16: note: see current operation: %332 = "tt.load"(%315, %165) <{cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 1, 0>}> : (tensor<1024x!tt.ptr<f32, 1>, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>, tensor<1024xi1, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>) -> tensor<1024xf32, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>