You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Roughly speaking, there are two completion semantics for a send operation:
case 1: the send operation is considered completed when the send buffer can be reused.
case 2: the send operation is considered completed when the completion of its corresponding receive no longer depends on calling LCI_progress on this side.
Currently, the completion of LCI is more on the case 1 side: LCI_sends, LCI_sendm, and LCI_sendmn do not take a completion object because its send buffer can be immediately reused; the completion semantics of LCI_sendl depends on the rendezvous protocol (case 1 completion for the "writeimm" protocol and case 2 completion for the write protocol).
The lack of case 2 completion semantics can result in hanging at the very end as some processes send their last messages and exit but the others can never get them.
The text was updated successfully, but these errors were encountered:
I think the proper way to handle this is to track if any sends aren't completed at the network level and block in LCI_finalize until they are complete, calling LCI_progress as necessary.
This should be simple to track, I think: have atomic counters for "started" and "completed" sends, progress in a loop during finalize until they are equal. Or just a single counter for "in progress" sends and compare to 0, but that maybe adds slightly more cache invalidations? Not sure how much it matters, really.
I don't think we need to explicitly expose the fact that sends aren't complete at the hardware level (due to buffering) to the user.
Roughly speaking, there are two completion semantics for a send operation:
LCI_progress
on this side.Currently, the completion of LCI is more on the case 1 side:
LCI_sends
,LCI_sendm
, andLCI_sendmn
do not take acompletion
object because its send buffer can be immediately reused; the completion semantics ofLCI_sendl
depends on the rendezvous protocol (case 1 completion for the "writeimm" protocol and case 2 completion for thewrite
protocol).The lack of case 2 completion semantics can result in hanging at the very end as some processes send their last messages and exit but the others can never get them.
The text was updated successfully, but these errors were encountered: