Capturing all HMEM requirements and limitations #6535

shefty · 2021-01-28T21:12:46Z

shefty
Jan 28, 2021
Maintainer

The goal of this discussion is to update the API and man pages so that HMEM support is more fully defined, with clear requirements on steps that the applications and providers need to take to ensure that the correct semantics are met.

The API attributes were defined assuming that data transfers were to/from host memory. Specifically, attributes such as message ordering, max RMA order size, completion semantics, etc. have not been adjusted to handle heterogenous memory. As a result, when HMEM is enabled, applications may assume that the same attributes hold true, when they may not based on the implementation.

For example, message ordering may indicate that write-after-write ordering is maintained. However, if the first write targets GPU memory, and the second write targets host memory, it's possible for a host process at the target to see the results of the second write prior to being able to see the results of the first write. The OFI NCCL provider was actually coded knowing that this would be an issue and posts a read operation to the GPU kernel after receiving an RMA write in order to flush the GPU buffers and ensure that memory is consistent prior to posting computational kernels to the GPU.

It was mentioned in the OFIWG, that true one-sided operations (e.g. SHMEM) performing writes to different memory domains at the target could run into similar issues. An RMA write could be performed to GPU memory. An separate write could be made between processes on the same node. But if the other process tries to access the GPU, it may not find the updated data without first performing some sort of sync/flush operation.

It's possible in certain cases that the provider can perform the sync/flush call, but that could impose performance penalties. For example, the initiator of an RMA write doesn't necessarily know that the write is going to GPU memory. The initiator doesn't even need to have HMEM support enabled.

In addition to message ordering problems, the completion semantics may need to be expanded as well. The completion semantics are defined wrt the initiator of the transfer, not the target. There is an additional problem in that the FI_DELIVERY_COMPLETE semantic, which indicates that the data has been placed into the destination buffer (as viewed by any application that tries to read from the buffer will see the updated data), is not easily implemented by hardware. Most current NICs are limited to meeting the FI_TRANSMIT_COMPELTE semantic, or lower.

shefty · 2021-01-28T21:57:29Z

shefty
Jan 28, 2021
Maintainer Author

Related to this topic, I had a discussion in the past around NICs that support caching atomic operations and other transfers. NICs that have the ability could potentially speed-up collective operations by avoiding writes through to host memory on each updated. That discussion lead to a similar problem related to target side completions. An app at the target would need to take additional steps to flush cached data from the NIC to memory prior to trying to access the data.

This behavior is actually captured in the FI_DELIVERY_COMPLETE description, which uses this wishy-washy wording:

A completion guarantees that the result of the operation is available; however, additional steps may need to be taken at the destination to retrieve the results

Though no mechanism was ever defined to handle this case. (It was theoretical at the time).

2 replies

shefty Feb 9, 2021
Maintainer Author

Notes from ofiwg discussion on Feb 9.
There's a need to specify receive completion semantics. That is missing from the API. For posted receives to host memory, the default is FI_DELIVERY_COMPLETE. For receives posted to HMEM/device memory, it's not defined, but aligns with FI_TRANSMIT_COMPLETE. The latter requires some sort of flush operation at the receiver.

One possible solution is to define a mode bit for the receiver. When set, the provider can set a CQ flag (e.g. FI_FLUSH) on any receive completion that targeted device memory. That flag indicates that the app must call fi_flush/fi_complete/fi_sync? to ensure that the receive data is present.

More precisely, the default receive completion model would be FI_DELIVERY_COMPLETE. A mode bit would indicate that the target can handle FI_TRANSMIT_COMPLETE (or lower) semantics. A receive completion that was FI_TRANSMIT_COMPLETE would set FI_TRANSMIT_COMPLETE as a CQ flag. When that flag is set, the app would call fi_flush, specifying an address range, and the desired completion semantic needed. All completed operations (i.e. receive completions read from the CQ), that targeted the specified range would meet the specified completion semantic.

This proposal should work at the target with posted receives, and handles both device and persistent memory. There is still a gap in how to handle RMA writes, since a receive side completion isn't necessarily generated. However, the fi_flush call would still be usable in that case.

There is a need to document that data/message ordering is restricted to each memory domain. Ordering between domains is not guaranteed. Again, an fi_flush call could be used at the target to force a completion semantic, which would ensure data consistency between memory domains.

One missing piece from the API is that there is no way for a provider to report what completion semantics it supports. Also needed would be a way to report the lowest level default completion semantic. For example, RxD would indicate that FI_DELIVERY_COMPLETE is the default semantic. RxM over verbs, and verbs MSG EPs would report FI_TRANSMIT_COMPLETE as the default semantic. RxM over tcp, and tcp MSG EPs would report FI_INJECT_COMPLETE as the default semantic.

shefty Feb 9, 2021
Maintainer Author

Adding to the previous comment. More generically, we can define a completion at the target side as being either FI_DELIVERY_COMPLETE or FI_COMMIT_COMPLETE (depending on the request), as long as the completion is written to the same memory domain as the operation. For completions written to a different memory domain, the semantic is FI_TRANSFER_COMPLETE. We'll want the more generic definition in order to handle completions that are written to a device, rather than to the host. We haven't defined such a feature yet, but I'm pretty sure we'll want 'completion redirection' someday.

Considering the RMA case, simply having an fi_flush call, along with the receive side completion semantics may be sufficient. It would be up to the application to associate a receive completion as signaling that related RMA operations have completed.

I think data ordering between memory domains are also covered by this definition. For example, if RMA writes are ordered, then writes to memory domain 1 will reach FI_TRANSFER_COMPLETE semantics prior to writes to memory domain 2. We'll need this level of guarantee to handle cases where the app does an RMA write to a device, followed by sending a message which lands in host memory. The received message must be able to indicate that the RMA write is ready to be flushed for syncing purposes.

shefty · 2021-02-24T01:45:15Z

shefty
Feb 24, 2021
Maintainer Author

Condensed notes from OFIWG Feb 23, 2021:

If the initiator requests FI_DELIVER_COMPLETE semantics, the target should see that same semantic. This should hold for any type of transfer. If needed, the provider may need to perform a flush operation at the target if a transfer lands in device memory.

A suggestion was made that an app could specify a completion semantic at the receive side. Today, posted receive operations complete using FI_DELIVERY_COMPLETE semantics as a default, independent from the sender's completion semantic. That would remain in place, and be defined in the man pages. If a receive buffer references device memory, the provider is responsible for performing flush operations prior to writing a completion. However, an application can specify FI_TRANSMIT_COMPLETE when posting a receive buffer. This indicates that the app will be responsible for performing a flush prior to accessing the buffer. This gives the app more control and could allow batching completions.

The most immediate need for hmem is to support 2-sided transfers (send/receive). The above proposal appears to be sufficient for this case and requires minimal impact to the API, and possibly no changes to the apps.

For 1-sided transfers that target a different memory domain from the CQ, there is still a gap. The target must be notified somehow that a transfer has occurred. Two options are receiving a message or reading a completion with CQ data sent with the transfer. In both cases, the completion goes to host memory, which may require performing a flush operation. In this case, provider software may not be aware of the RMA operation having occurred. This likely requires that the app handle the situation, with the man pages simply giving guidance on what step the app may need to make.

Related, a flush operation can be added to the API, or the app can be required to use a device API directly, or both options could be supported. If flush is supported, an app could override it through the hmem ops override mechanism. I think the general feeling was to let the app call the device API directly in situations where it is needed. Providers would call flush/sync internally to support the desired API semantics.

0 replies

rwespetal · 2021-03-09T01:31:50Z

rwespetal
Mar 9, 2021
Collaborator

Here's a pointer to where the OFI NCCL plugin implements a flush using fi_read and where flush is defined in the nccl_net header. This flush operation is issued when it is a CUDA pointer to ensure the data is visible.

1 reply

shefty Mar 9, 2021
Maintainer Author

Thanks - I was actually reading over this discussion to figure out what still needs to be resolved.

shefty · 2021-03-09T17:47:06Z

shefty
Mar 9, 2021
Maintainer Author

From March 9, 2021 ofiwg:

Man pages need to be updated to define 'memory domain'. Data and message ordering applies to transfers targeting the same memory domain. Though there's still an issue that a completion may be written to another domain.

The best we can do now is document that an application may be required to take some other action to ensure that memory is consistent if transfers target a memory domain separate from where completions are written. However, FI_DELIVERY_COMPLETE semantics should imply global visibility, meaning that the app shouldn't need to take additional actions.

If a message follows an RMA, and they both go to the same memory domain, the message can indicate that the RMA data is present (assuming SAW ordering). But SAW and other ordering is not guaranteed between memory domains.

On a related topic, see if we can simply add the completion flags to the fi_getinfo caps bits. This allows an app to specify which completion semantics it requires up front, and for the provider to report what it can support optimally if none are set. The completion flags are already defined within the caps flag space, so adding them is possible.

0 replies

shefty · 2021-03-12T19:05:59Z

shefty
Mar 12, 2021
Maintainer Author

PR #6629 contains a novella that attempts to document HMEM requirements in the man pages. The plot's a little weak, as I had trouble trying to identify the antagonist.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capturing all HMEM requirements and limitations #6535

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Capturing all HMEM requirements and limitations #6535

shefty Jan 28, 2021 Maintainer

Replies: 5 comments · 3 replies

shefty Jan 28, 2021 Maintainer Author

shefty Feb 9, 2021 Maintainer Author

shefty Feb 9, 2021 Maintainer Author

shefty Feb 24, 2021 Maintainer Author

rwespetal Mar 9, 2021 Collaborator

shefty Mar 9, 2021 Maintainer Author

shefty Mar 9, 2021 Maintainer Author

shefty Mar 12, 2021 Maintainer Author

shefty
Jan 28, 2021
Maintainer

Replies: 5 comments 3 replies

shefty
Jan 28, 2021
Maintainer Author

shefty Feb 9, 2021
Maintainer Author

shefty Feb 9, 2021
Maintainer Author

shefty
Feb 24, 2021
Maintainer Author

rwespetal
Mar 9, 2021
Collaborator

shefty Mar 9, 2021
Maintainer Author

shefty
Mar 9, 2021
Maintainer Author

shefty
Mar 12, 2021
Maintainer Author