Replies: 5 comments 3 replies
-
Related to this topic, I had a discussion in the past around NICs that support caching atomic operations and other transfers. NICs that have the ability could potentially speed-up collective operations by avoiding writes through to host memory on each updated. That discussion lead to a similar problem related to target side completions. An app at the target would need to take additional steps to flush cached data from the NIC to memory prior to trying to access the data. This behavior is actually captured in the FI_DELIVERY_COMPLETE description, which uses this wishy-washy wording:
Though no mechanism was ever defined to handle this case. (It was theoretical at the time). |
Beta Was this translation helpful? Give feedback.
-
Condensed notes from OFIWG Feb 23, 2021: If the initiator requests FI_DELIVER_COMPLETE semantics, the target should see that same semantic. This should hold for any type of transfer. If needed, the provider may need to perform a flush operation at the target if a transfer lands in device memory. A suggestion was made that an app could specify a completion semantic at the receive side. Today, posted receive operations complete using FI_DELIVERY_COMPLETE semantics as a default, independent from the sender's completion semantic. That would remain in place, and be defined in the man pages. If a receive buffer references device memory, the provider is responsible for performing flush operations prior to writing a completion. However, an application can specify FI_TRANSMIT_COMPLETE when posting a receive buffer. This indicates that the app will be responsible for performing a flush prior to accessing the buffer. This gives the app more control and could allow batching completions. The most immediate need for hmem is to support 2-sided transfers (send/receive). The above proposal appears to be sufficient for this case and requires minimal impact to the API, and possibly no changes to the apps. For 1-sided transfers that target a different memory domain from the CQ, there is still a gap. The target must be notified somehow that a transfer has occurred. Two options are receiving a message or reading a completion with CQ data sent with the transfer. In both cases, the completion goes to host memory, which may require performing a flush operation. In this case, provider software may not be aware of the RMA operation having occurred. This likely requires that the app handle the situation, with the man pages simply giving guidance on what step the app may need to make. Related, a flush operation can be added to the API, or the app can be required to use a device API directly, or both options could be supported. If flush is supported, an app could override it through the hmem ops override mechanism. I think the general feeling was to let the app call the device API directly in situations where it is needed. Providers would call flush/sync internally to support the desired API semantics. |
Beta Was this translation helpful? Give feedback.
-
Here's a pointer to where the OFI NCCL plugin implements a flush using |
Beta Was this translation helpful? Give feedback.
-
From March 9, 2021 ofiwg: Man pages need to be updated to define 'memory domain'. Data and message ordering applies to transfers targeting the same memory domain. Though there's still an issue that a completion may be written to another domain. The best we can do now is document that an application may be required to take some other action to ensure that memory is consistent if transfers target a memory domain separate from where completions are written. However, FI_DELIVERY_COMPLETE semantics should imply global visibility, meaning that the app shouldn't need to take additional actions. If a message follows an RMA, and they both go to the same memory domain, the message can indicate that the RMA data is present (assuming SAW ordering). But SAW and other ordering is not guaranteed between memory domains. On a related topic, see if we can simply add the completion flags to the fi_getinfo caps bits. This allows an app to specify which completion semantics it requires up front, and for the provider to report what it can support optimally if none are set. The completion flags are already defined within the caps flag space, so adding them is possible. |
Beta Was this translation helpful? Give feedback.
-
PR #6629 contains a novella that attempts to document HMEM requirements in the man pages. The plot's a little weak, as I had trouble trying to identify the antagonist. |
Beta Was this translation helpful? Give feedback.
-
The goal of this discussion is to update the API and man pages so that HMEM support is more fully defined, with clear requirements on steps that the applications and providers need to take to ensure that the correct semantics are met.
The API attributes were defined assuming that data transfers were to/from host memory. Specifically, attributes such as message ordering, max RMA order size, completion semantics, etc. have not been adjusted to handle heterogenous memory. As a result, when HMEM is enabled, applications may assume that the same attributes hold true, when they may not based on the implementation.
For example, message ordering may indicate that write-after-write ordering is maintained. However, if the first write targets GPU memory, and the second write targets host memory, it's possible for a host process at the target to see the results of the second write prior to being able to see the results of the first write. The OFI NCCL provider was actually coded knowing that this would be an issue and posts a read operation to the GPU kernel after receiving an RMA write in order to flush the GPU buffers and ensure that memory is consistent prior to posting computational kernels to the GPU.
It was mentioned in the OFIWG, that true one-sided operations (e.g. SHMEM) performing writes to different memory domains at the target could run into similar issues. An RMA write could be performed to GPU memory. An separate write could be made between processes on the same node. But if the other process tries to access the GPU, it may not find the updated data without first performing some sort of sync/flush operation.
It's possible in certain cases that the provider can perform the sync/flush call, but that could impose performance penalties. For example, the initiator of an RMA write doesn't necessarily know that the write is going to GPU memory. The initiator doesn't even need to have HMEM support enabled.
In addition to message ordering problems, the completion semantics may need to be expanded as well. The completion semantics are defined wrt the initiator of the transfer, not the target. There is an additional problem in that the FI_DELIVERY_COMPLETE semantic, which indicates that the data has been placed into the destination buffer (as viewed by any application that tries to read from the buffer will see the updated data), is not easily implemented by hardware. Most current NICs are limited to meeting the FI_TRANSMIT_COMPELTE semantic, or lower.
Beta Was this translation helpful? Give feedback.
All reactions