Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Feature Discussion】Make tb_plugin support larger traces #760

Open
lh-ycx opened this issue May 18, 2023 · 2 comments
Open

【Feature Discussion】Make tb_plugin support larger traces #760

lh-ycx opened this issue May 18, 2023 · 2 comments
Labels
enhancement New feature or request plugin PyTorch Profiler TensorBoard Plugin related

Comments

@lh-ycx
Copy link
Contributor

lh-ycx commented May 18, 2023

Hi guys, I'm recently working on profiling LLM (like GPT-3) workloads using the Pytorch profiler. When I tried to visualize the trace using tensorboard, one major problem was that the trace is too large (typically > 1GB for one step), causing no response in the browser.

I would like to work on this problem to make the tb_plugin support larger traces. Below are a few of my concerns:

  • I cannot find a doc for tb_plugin (about the code structure and the plugin's arch), which makes it difficult for me to get started.
  • Could you provide some insight into this problem? like where is the possible bottleneck (the original tensorboard? the trace analysis server? the frontend browser? or something else?)
  • I also noticed that "monitoring daemon for larger scale deployments" is in progress, is this solving the same problem? If yes, can I get involved?

Thanks : )

@aaronenyeshi aaronenyeshi added the plugin PyTorch Profiler TensorBoard Plugin related label Jun 23, 2023
@aaronenyeshi
Copy link
Member

Hi @lh-ycx , please feel free to contribute to the tb_plugin directory (the code is outdated, and under maintenance mode only).

  • As far as I know, the documentation can be found here: https://github.com/pytorch/kineto/blob/main/tb_plugin/README.md .
  • This is also a problem for Chrome traces, when using the export_chrome_trace API. One workaround is to move the traceEvents to a root JSON dict, and then open it in Perfetto. Another workaround is to turn with_stack=False, which reduces the amount of events significantly.
  • My intuition is that there are too many events to show in the frontend browser. One potential solution is to sample events when we're zoomed out. This idea is because, on chrome trace, you can delete a portion of the events, and it will start loading in the frontend.

@aaronenyeshi aaronenyeshi added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Jun 23, 2023
@lmelinda
Copy link

I also wonder how much work would it be to incorporate Perfetto UI into the trace tab? For all my large trace files, they load pretty fast in Perfetto UI and does not crash like tensor board profiler tab. If not this could resolve a lot of problems.

facebook-github-bot pushed a commit that referenced this issue Jul 31, 2023
Summary:
Enhancement for Issue #760.

Hey guys, I've optimized the speed of the memory view using the LTTB sampling (Largest-Triangle-Three-Buckets sampling, which is able to downsample time series–like data while retaining the overall shape and variability in the data).

I've tested this using a PyTorch profiler trace of 2G, and the memory view page will not get crashed and the scaling operation is smooth and rather acceptable to me.

Pull Request resolved: #776

Reviewed By: chaekit

Differential Revision: D47850048

Pulled By: aaronenyeshi

fbshipit-source-id: 4d32666f972c7f1b5d18817f69c3266bcb619d92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin PyTorch Profiler TensorBoard Plugin related
Projects
None yet
Development

No branches or pull requests

3 participants