-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with ttnn::queue_synchronize and ttnn::event_synchronize, especially with multiple producers. #10332
Comments
Hi @tt-asaigal . Can you check this issue? |
Hey, for the concern raised about
This will free up the worker thread to pick up tasks pushed by another application thread. For the issue regarding
If not, such a setup can non-deterministically enter an invalid state.
Fundamentally, if each thread is tied to a specific CQ, we don't have to worry about queue flushes accounting for tasks pushed by more than one application thread. Additionally, we propose the use of on-device event synchronization using |
Use of Event APIs as a Workaround
I used the code above instead of Software Queue and Hardware Queuehttps://community.amd.com/t5/opencl/how-to-use-opencl-multiple-command-queues/m-p/599573
Answers to your question.
Suppose that there are 4 threads, and each thread has its own software queue.
This is the case when multiple app threads use the same CQ.
As explained, This is the case that multiple app threads use the same CQ.
It means that the ordering of commands in the same software queue should be guaranteed. But the ordering of commands across different software queues does not need to be guaranteed. |
- Add multi-threaded event sync test
Hey @namhyeong-kim thanks for the explanation. I understand your use case better now.
Yes the host side queue is the worker queue. Given that you're using event synchronization APIs to ensure ordering across software queues and worker threads, I have a change staged here, that moves the stall to the application thread. I've added a test here to verify functionality similar to what was described above. This should take care of the first issue you mentioned, with the work-executor stalling and not being able to pick up work from other application threads, if one issues a host side sync. |
- Add multi-threaded event sync test
- Add multi-threaded event sync test
- Add multi-threaded event sync test
- Add multi-threaded event sync test
…1543) - Add multi-threaded event sync test
…1543) - Add multi-threaded event sync test
Closing the issue and will re-open later if required. |
There are performance issues with the two synchronization ttnn apis when
WorkExecutorMode::AsyncMode
is set and multi producer threads useworker_queue
. Let's call two producer threads each app1 and app2.ttnn::event_synchronize(Device* device, std::shared_ptr event)
Link to code
ttnn::event_synchronize
.EventSynchronize(event)
is enqueued intoworker_queue
.work_executor
callsEvenySynchronize(event)
and is blocked.worker_queue
.cq2
becausework_executor
is blocked until the event is completed.i. This degrades performance.
Expected behavior
work_executor
should not be blocked byEvenySynchronize(event)
. Only app1 should be blocked.work_executor
should enqueue W2~W4 intocq2
while app1 waits for the event.ttnn::queue_synchronize(CommandQueue& cq)
Link to Code
Logic of ttnn::queue_synchronize
ttnn::queue_synchronize(CommandQueue& cq)
callscq.device()->synchronize()
to wait forworker_queue
flushed, and then callsFinish(cq)
to wait for works completed. How to wait for theworker_queue
? It uses polling mechanism. It checks if theworker_queue
is empty; If so, it breaks while loop. Otherwise, it sleeps for 10 us and go back to while loop. This polling mechanism raises performance issue.Link to code
cq.device()->synchronize() problem.
Let's assume that app1 called
ttnn::queue_synchronize(cq1)
and is currently sleeping.work_executor
.worker_queue
withcq1
while app1 sleeps.worker_queue
is empty.worker_queue
has 3 works so app1 sleeps for 10 us again.i. This degrades performance. app1 should not sleep again. app1 should wait for only W1 to be enqueued.
As the above scenario, app1 sleeps again and again if app2 enqueues works into the
worker_queue
.Finish(cq) problem.
work_executor
.i. This degrades performance because app1 waits for only W1 to be completed, not for W2~W4.
Expected behavior
cq1
and completed'.Generally,
cq.device()->synchronize()
, app1 should not wait for works enqueued after callingttnn::queue_synchronize
to be flushed inworker_queue
.Finish(cq)
, app1 should not wait for works enqueued after callingttnn::queue_synchronize
to be completed inCommandQueue
.The text was updated successfully, but these errors were encountered: