You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is not entirely UCC related, but relates to UCC integration into an MPI library (in this case Open MPI).
Trigger: If an application calls MPI_Finalize without properly freeing all communicators there is no way to properly release UCC resources.
The root cause is the way UCC tracks OMPI communicators destruction, not by refcounting the collective referenced by the communicator but by adding an attribute (COMM_ATTR) on each communicator and expecting the attribute to be properly deleted (which will call into ucc_comm_attr_del_fn). This is the case when the user free a communicator via MPI_Comm_free, but the MPI standard does not required the release of attributes for non freed communicators in MPI_Finalize (at least not for attributes not associated with MPI_COMM_WORLD). The outcome is a disconnect between OMPI (where all communicators are released) and UCC where they are not.
A proper solution could be complicated to implement. It would require to refcount all the collectives used by OMPI, or to track all communicators internally and release them when the attribute deletion function (ucc_comm_attr_del_fn) is called on MPI_COMM_WORLD.
The text was updated successfully, but these errors were encountered:
Technically yes, but this is not completely over yet for UCC and HCOL. In open-mpi/ompi#12429 I have not removed the attribute on MPI_COMM_WORLD that both libraries are using to detect MPI_Finalize. I don't think it is needed anymore, but such a change must come from y'all.
This issue is not entirely UCC related, but relates to UCC integration into an MPI library (in this case Open MPI).
Trigger: If an application calls
MPI_Finalize
without properly freeing all communicators there is no way to properly release UCC resources.The root cause is the way UCC tracks OMPI communicators destruction, not by refcounting the collective referenced by the communicator but by adding an attribute (COMM_ATTR) on each communicator and expecting the attribute to be properly deleted (which will call into
ucc_comm_attr_del_fn
). This is the case when the user free a communicator viaMPI_Comm_free
, but the MPI standard does not required the release of attributes for non freed communicators inMPI_Finalize
(at least not for attributes not associated withMPI_COMM_WORLD
). The outcome is a disconnect between OMPI (where all communicators are released) and UCC where they are not.A proper solution could be complicated to implement. It would require to refcount all the collectives used by OMPI, or to track all communicators internally and release them when the attribute deletion function (
ucc_comm_attr_del_fn
) is called onMPI_COMM_WORLD
.The text was updated successfully, but these errors were encountered: