Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler #3835

Closed
wants to merge 2 commits into from

Conversation

manman-ren
Copy link
Collaborator

@manman-ren manman-ren commented May 3, 2024

This will print remarks like below
01-vector-add.py:48:16: remark: vec = 4
x = tl.load(x_ptr + offsets, mask=mask)
^
With printOpOnDiagnostic being true, it will show
01-vector-add.py:49:16: remark: vec = 4
y = tl.load(y_ptr + offsets, mask=mask)
^
01-vector-add.py:49:16: note: see current operation: %332 = "tt.load"(%315, %165) <{cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 1, 0>}> : (tensor<1024x!tt.ptr<f32, 1>, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>, tensor<1024xi1, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>) -> tensor<1024xf32, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren manman-ren requested a review from ptillet as a code owner May 3, 2024 17:11
@manman-ren manman-ren marked this pull request as draft May 3, 2024 17:11
@manman-ren
Copy link
Collaborator Author

I am thinking about using remarks to capture key optimization information and the information can be integrated into a performance analysis tool (Proton may be a good choice). CC @Jokeren
I am interested in the following information, not sure which info will be from the compiler and which can be from Proton:

  • number of layout conversions, memory transfer size for each conversion
  • number of loads/stores, for each load, store, is it coalesced? Is it vectorized?
  • For each loop, is SWP efficiently enabled?
  • Which mma instruction is used for tl.dot? Will transposing make tl.dot eligible for mma v3?

Remarks can show location for both .py source and .ttgir file.

I haven't figured out how to filter Diagnostics for SourceMgrDiagnosticHandler. Currently we will also get some warnings:
01-vector-add.py:28:0: warning: Unhandled parameter attribute 'tt.divisibility'
If printOpOnDiagnostic is on, it will dump the whole function.

@manman-ren manman-ren requested a review from htyu May 3, 2024 17:21
@Jokeren
Copy link
Contributor

Jokeren commented May 3, 2024

This is interesting. I need to think more about it though.

Could we keep it as a draft PR and come back discussing it later. I promise that I won't forget :)

@Jokeren
Copy link
Contributor

Jokeren commented May 3, 2024

Saved it to my github folder.

@Jokeren
Copy link
Contributor

Jokeren commented May 7, 2024

Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata.

I have two suggestions.

  1. We could provide the kernel source file path to proton and let it analyzes the triton GPU IR using existing passes to recover information.
  2. Once we have all the information, we could first reuse our existing viewer to view at most two metrics on the terminal. Next, maybe it's time to start integrating it with some existing visualizers because we need to view multiple metrics and source lines.

I may spend some time working on this soon :) Happy to chat if you have any thoughts

@manman-ren
Copy link
Collaborator Author

Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata.

I have two suggestions.

  1. We could provide the kernel source file path to proton and let it analyzes the triton GPU IR using existing passes to recover information.
  2. Once we have all the information, we could first reuse our existing viewer to view at most two metrics on the terminal. Next, maybe it's time to start integrating it with some existing visualizers because we need to view multiple metrics and source lines.

I may spend some time working on this soon :) Happy to chat if you have any thoughts

Thanks for taking the time!
For 1> My current plan is to emit optimization remarks (it will come with source line and ttgir line). Maybe Proton can parse the remarks. I am not sure how easy it is to let Proton run passes on GPU IR. I have a follow up patch that tries to add a pass to collect information and to emit optimization remarks from lowering ttgir to llvm.
For 2> It will be great if we can have a visualizer to look at profiling data together with analysis data, we can have one tool for AMD/NV and also for PT2/Triton.

@Jokeren
Copy link
Contributor

Jokeren commented May 7, 2024

For 1> My current plan is to emit optimization remarks (it will come with source line and ttgir line). Maybe Proton can parse the remarks. I am not sure how easy it is to let Proton run passes on GPU IR. I have a follow up patch that tries to add a pass to collect information and to emit optimization remarks from lowering ttgir to llvm.

Using a separate pass to get the information is better than instrumenting existing compiler passes.

But still, I'm worried about parsing the remark.

If instead we parse the GPU IR by calling module = ir.parse_mlir_module(full_name, context), and use an analysis function to analyze the IR, it will offer the following benefits:

  1. This function can utilize all existing analysis modules.
  2. Proton can get the parsed information in memory without parsing the file.
  3. The analysis function can be called as a python interface of triton and benefit other profilers or analyzers.

The design would be

class result:
    file: str
    line_no: int
    column_no: int
    remarks: dict[str, problem]

class func_result:
   results: list[result] 

class module_result:
    func_results: dict[str, func_result]
 
module = ir.parse_mlir_module(<gpu ir>)
analysis_results = libtriton.analyze_module(module)
libproton.inspect(analysis_results)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren
Copy link
Collaborator Author

Yeah, I added a separate pass but there are decisions made during the lowering phase, such as vectorization, if a load instruction will be coalesced etc. For those, I am hoping to use optimization remarks.
#3853 is adding a perf-collection pass. It is in draft mode, mostly for discussion purpose.

// Check to see if this op is coalesced. Depending on vectorization, we
// can have varying number of instructions to perform the load. Each
// instruction handles vec elements. Each thread handles numElems, and
// next thread handles vec elements starting with sizePerThread.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jokeren @htyu Is it possible to get this information at ttgir level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@manman-ren
Copy link
Collaborator Author

Discussions are now in the new PR: #3922
closing this

@manman-ren manman-ren closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants