Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add event sampling option to profiler #1209

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cristianlume
Copy link

[this is a draft request intended as an FYI and to seek initial feedback]

We are proposing to extend the NCCL profiler code with a new sampled mode to selectively save events. Sampling a small subset, rather than collecting all NCCL proxy events, preserves the statistical properties of the unsampled data, at little computational cost and large savings in storage.

All changes are in the profiler.cc code and we are not proposing any external profiler API change. We describe them in detail below:

Enable sampling

  • Add two ncclParams to enable sampling and specify the sampling rate
  • Enabling sampling does not change the default profiler operation, i.e. we still record all events until the profilingEvents buffer is full. While the buffer is not full, we use it to store the current event to be sampled. When the buffer becomes full, we allocate a new sampledEvent memory location to store the event currently being sampled.
  • For every new event (i.e., on ncclProxyBegin state) we check whether the event should be sampled or not. At the end of the event (i.e., on ncclProxyEnd state), we save the event only if it was selected to be sampled.

Saving sampled data

  • We write the sampled event info to an eBPF map. Similar to files, eBPF maps can be operated on with simple syscalls with no other external dependencies. An external process (not part of NCCL) can find and read eBPF map via syscalls without requiring any NCCL bootstrapping.

@vjardin
Copy link

vjardin commented Mar 4, 2024

What is the benefit of a eBPF Map here instead of using mapped shared memory ?

@sjeaugey
Copy link
Member

sjeaugey commented Mar 5, 2024

Note that the profiler code in NCCL is there only to serve as an example of things one might want to profile. It's code I usually hack on the spot when I need it, then remove once I'm done. I left it there for the next time, and also in case it could benefit others. It is not even fully functional (may crash or corrupt data when events are full) so we could remove it entirely, or replace it by something more advanced if needed.

I would actually think that an external plugin would make sense here, to allow integration with a variety of tools, but we don't have much time to dedicate to that at the moment.

@cristianlume
Copy link
Author

What is the benefit of a eBPF Map here instead of using mapped shared memory ?

Working with a bpf map is easier and less complex than working with shared memory. With bpf, all we need is the map_id to attach to an existing map and we do not need to be root for this. With shared memory, we may have to implement process synchronization separately. We also would have to share the mapped path, which means additional work to take care of permissions and synchronizing mount points between writer and reader.

@vjardin
Copy link

vjardin commented Mar 5, 2024

Thanks for this ack't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants