Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835

manman-ren · 2024-05-03T17:11:27Z

This will print remarks like below
01-vector-add.py:48:16: remark: vec = 4
x = tl.load(x_ptr + offsets, mask=mask)
^
With printOpOnDiagnostic being true, it will show
01-vector-add.py:49:16: remark: vec = 4
y = tl.load(y_ptr + offsets, mask=mask)
^
01-vector-add.py:49:16: note: see current operation: %332 = "tt.load"(%315, %165) <{cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 1, 0>}> : (tensor<1024x!tt.ptr<f32, 1>, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>, tensor<1024xi1, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>) -> tensor<1024xf32, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

manman-ren · 2024-05-03T17:18:52Z

I am thinking about using remarks to capture key optimization information and the information can be integrated into a performance analysis tool (Proton may be a good choice). CC @Jokeren
I am interested in the following information, not sure which info will be from the compiler and which can be from Proton:

number of layout conversions, memory transfer size for each conversion
number of loads/stores, for each load, store, is it coalesced? Is it vectorized?
For each loop, is SWP efficiently enabled?
Which mma instruction is used for tl.dot? Will transposing make tl.dot eligible for mma v3?

Remarks can show location for both .py source and .ttgir file.

I haven't figured out how to filter Diagnostics for SourceMgrDiagnosticHandler. Currently we will also get some warnings:
01-vector-add.py:28:0: warning: Unhandled parameter attribute 'tt.divisibility'
If printOpOnDiagnostic is on, it will dump the whole function.

Jokeren · 2024-05-03T20:00:41Z

This is interesting. I need to think more about it though.

Could we keep it as a draft PR and come back discussing it later. I promise that I won't forget :)

Jokeren · 2024-05-03T20:01:50Z

Saved it to my github folder.

Jokeren · 2024-05-07T13:16:42Z

Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata.

I have two suggestions.

We could provide the kernel source file path to proton and let it analyzes the triton GPU IR using existing passes to recover information.
Once we have all the information, we could first reuse our existing viewer to view at most two metrics on the terminal. Next, maybe it's time to start integrating it with some existing visualizers because we need to view multiple metrics and source lines.

I may spend some time working on this soon :) Happy to chat if you have any thoughts

manman-ren · 2024-05-07T18:41:38Z

Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata.

I have two suggestions.

We could provide the kernel source file path to proton and let it analyzes the triton GPU IR using existing passes to recover information.

Once we have all the information, we could first reuse our existing viewer to view at most two metrics on the terminal. Next, maybe it's time to start integrating it with some existing visualizers because we need to view multiple metrics and source lines.

I may spend some time working on this soon :) Happy to chat if you have any thoughts

Thanks for taking the time!
For 1> My current plan is to emit optimization remarks (it will come with source line and ttgir line). Maybe Proton can parse the remarks. I am not sure how easy it is to let Proton run passes on GPU IR. I have a follow up patch that tries to add a pass to collect information and to emit optimization remarks from lowering ttgir to llvm.
For 2> It will be great if we can have a visualizer to look at profiling data together with analysis data, we can have one tool for AMD/NV and also for PT2/Triton.

Jokeren · 2024-05-07T19:34:49Z

For 1> My current plan is to emit optimization remarks (it will come with source line and ttgir line). Maybe Proton can parse the remarks. I am not sure how easy it is to let Proton run passes on GPU IR. I have a follow up patch that tries to add a pass to collect information and to emit optimization remarks from lowering ttgir to llvm.

Using a separate pass to get the information is better than instrumenting existing compiler passes.

But still, I'm worried about parsing the remark.

If instead we parse the GPU IR by calling module = ir.parse_mlir_module(full_name, context), and use an analysis function to analyze the IR, it will offer the following benefits:

This function can utilize all existing analysis modules.
Proton can get the parsed information in memory without parsing the file.
The analysis function can be called as a python interface of triton and benefit other profilers or analyzers.

The design would be

class result:
    file: str
    line_no: int
    column_no: int
    remarks: dict[str, problem]

class func_result:
   results: list[result] 

class module_result:
    func_results: dict[str, func_result]
 
module = ir.parse_mlir_module(<gpu ir>)
analysis_results = libtriton.analyze_module(module)
libproton.inspect(analysis_results)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

manman-ren · 2024-05-07T19:59:29Z

Yeah, I added a separate pass but there are decisions made during the lowering phase, such as vectorization, if a load instruction will be coalesced etc. For those, I am hoping to use optimization remarks.
#3853 is adding a perf-collection pass. It is in draft mode, mostly for discussion purpose.

manman-ren · 2024-05-07T20:00:11Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp

+ // Check to see if this op is coalesced. Depending on vectorization, we
+ // can have varying number of instructions to perform the load. Each
+ // instruction handles vec elements. Each thread handles numElems, and
+ // next thread handles vec elements starting with sizePerThread.


@Jokeren @htyu Is it possible to get this information at ttgir level?

manman-ren · 2024-05-21T19:58:28Z

Discussions are now in the new PR: #3922
closing this

Enable remarks for ttgir lowering

cfa8eef

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

manman-ren requested a review from ptillet as a code owner May 3, 2024 17:11

manman-ren marked this pull request as draft May 3, 2024 17:11

manman-ren requested a review from htyu May 3, 2024 17:21

update remark

cc8afc7

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

manman-ren commented May 7, 2024

View reviewed changes

manman-ren closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835

manman-ren commented May 3, 2024 •

edited

manman-ren commented May 3, 2024

Jokeren commented May 3, 2024

Jokeren commented May 3, 2024

Jokeren commented May 7, 2024

manman-ren commented May 7, 2024

Jokeren commented May 7, 2024

manman-ren commented May 7, 2024

manman-ren May 7, 2024

Jokeren May 7, 2024

manman-ren commented May 21, 2024

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835

Conversation

manman-ren commented May 3, 2024 • edited

manman-ren commented May 3, 2024

Jokeren commented May 3, 2024

Jokeren commented May 3, 2024

Jokeren commented May 7, 2024

manman-ren commented May 7, 2024

Jokeren commented May 7, 2024

manman-ren commented May 7, 2024

manman-ren May 7, 2024

Choose a reason for hiding this comment

Jokeren May 7, 2024

Choose a reason for hiding this comment

manman-ren commented May 21, 2024

manman-ren commented May 3, 2024 •

edited