Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with ttnn::queue_synchronize and ttnn::event_synchronize, especially with multiple producers. #10332

Closed
namhyeong-kim opened this issue Jul 16, 2024 · 5 comments
Labels
moreh moreh contribution P2

Comments

@namhyeong-kim
Copy link
Contributor

namhyeong-kim commented Jul 16, 2024

There are performance issues with the two synchronization ttnn apis when WorkExecutorMode::AsyncMode is set and multi producer threads use worker_queue. Let's call two producer threads each app1 and app2.

ttnn::event_synchronize(Device* device, std::shared_ptr event)

Link to code

void event_synchronize(Device* device, std::shared_ptr<Event> event) {
    device->push_work([event] () {
        EventSynchronize(event);
    });
    device->synchronize();
}
ttnn_event_synchronize
  1. app1 calls ttnn::event_synchronize. EventSynchronize(event) is enqueued into worker_queue.
  2. work_executor calls EvenySynchronize(event) and is blocked.
  3. app2 enqueues W2~W4 into worker_queue.
  4. W2~W4 cannot be enqueued into cq2 because work_executor is blocked until the event is completed.
    i. This degrades performance.

Expected behavior

work_executor should not be blocked by EvenySynchronize(event). Only app1 should be blocked. work_executor should enqueue W2~W4 into cq2 while app1 waits for the event.

ttnn::queue_synchronize(CommandQueue& cq)

Link to Code

void queue_synchronize(CommandQueue& cq) {
    // Ensure that all work pushed to async engine has been passed
    // off to device CQ
    cq.device()->synchronize();
    // Wait for device CQ to finish
    Finish(cq);
}

Logic of ttnn::queue_synchronize

ttnn::queue_synchronize(CommandQueue& cq) calls cq.device()->synchronize() to wait for worker_queue flushed, and then calls Finish(cq) to wait for works completed. How to wait for the worker_queue? It uses polling mechanism. It checks if the worker_queue is empty; If so, it breaks while loop. Otherwise, it sleeps for 10 us and go back to while loop. This polling mechanism raises performance issue.
Link to code

    inline void synchronize() {
        if (this->work_executor_mode == WorkExecutorMode::ASYNCHRONOUS and
            not(std::hash<std::thread::id>{}(std::this_thread::get_id()) == worker_queue.worker_thread_id.load())) {
            // Blocking = wait for queue flushed. Worker thread cannot explcitly insert a synchronize, otherwise we have
            // a deadlock.
            this->worker_queue.push([]() {});  // Send flush command (i.e. empty function)
            {
                std::lock_guard lock(this->cv_mutex);
                cv.notify_one();
            }
            // Wait for queue empty, i.e. flush command picked up
            while (not this->worker_queue.empty()) {
                std::this_thread::sleep_for(std::chrono::microseconds(10));
            };
        }
    }
ttnn_queue_synchronize

cq.device()->synchronize() problem.

Let's assume that app1 called ttnn::queue_synchronize(cq1) and is currently sleeping.

  1. app1 waits for W1 to be enqueued into cq1.
  2. W1 is enqueued by work_executor.
  3. app2 enqueues 3 works into worker_queue with cq1 while app1 sleeps.
  4. app1 wakes up and checks if worker_queue is empty. worker_queue has 3 works so app1 sleeps for 10 us again.
    i. This degrades performance. app1 should not sleep again. app1 should wait for only W1 to be enqueued.

As the above scenario, app1 sleeps again and again if app2 enqueues works into the worker_queue.

Finish(cq) problem.

  1. W2~W4 are enqueued into cq1 by work_executor.
  2. app1 waits for not only W1 but also W2~W4 completed.
    i. This degrades performance because app1 waits for only W1 to be completed, not for W2~W4.

Expected behavior

  1. app1 should wait for only W1 to be enqueued into cq1 and completed'.
  2. app1 should not wait for W2~W4 to be enqueued and completed.

Generally,

  1. In cq.device()->synchronize(), app1 should not wait for works enqueued after calling ttnn::queue_synchronize to be flushed in worker_queue.
  2. In Finish(cq), app1 should not wait for works enqueued after calling ttnn::queue_synchronize to be completed in CommandQueue.
@namhyeong-kim namhyeong-kim added the moreh moreh contribution label Jul 16, 2024
@namhyeong-kim
Copy link
Contributor Author

Hi @tt-asaigal . Can you check this issue?

@razorback3 razorback3 added the P2 label Jul 16, 2024
@tt-asaigal
Copy link
Contributor

Hey, for the concern raised about ttnn::event_synchronize, this API can be modified to only block in one of the application threads, i.e.:

void event_synchronize(std::shared_ptr<Event> event) {
        EventSynchronize(event);
}

This will free up the worker thread to pick up tasks pushed by another application thread.

For the issue regarding ttnn::queue_synchronize, it would be great if I could understand the use case better. Here are the questions I have:

  1. What’s the use case for multiple app threads using the same CQ? We were under the impression that each thread would be tied to a specific CQ with events used for synchronizing the tasks.

  2. Is a Host Side Queue Flush needed during workload execution? We have record_event and wait_for_event APIs that allow device to sync CQs, without host in the loop. These will provide better performance, as application threads will not get blocked.

  3. If multiple threads push to the same CQ, how do we guarantee ordering? Does the application use a mutex + some ordering mechanism to ensure that a CQ gets commands in order?

If not, such a setup can non-deterministically enter an invalid state.

If so, threads are completely serialized -> multi-threaded setup with single threaded performance.

Fundamentally, if each thread is tied to a specific CQ, we don't have to worry about queue flushes accounting for tasks pushed by more than one application thread.

Additionally, we propose the use of on-device event synchronization using record_event and wait_for_event until the host absolutely needs to block, since these APIs will provide better performance.

@namhyeong-kim
Copy link
Contributor Author

Use of Event APIs as a Workaround

void event_synchronize(std::shared_ptr<Event> event) {
        EventSynchronize(event);
}

I used the code above instead of queue_synchronize. But it changed on your commit so that work_executor waits for the event.

Software Queue and Hardware Queue

https://community.amd.com/t5/opencl/how-to-use-opencl-multiple-command-queues/m-p/599573
In the link, the concepts of software queue and hardware queue are explained.

  1. In CUDA and HIP, the streams are the software queues.
  2. In the other hand, there is no explicit software queue concept in tt-metal repository. The CQ is the hardware queue.
  3. Instead, worker_queue with WorkExecutorMode::AsyncMode and WorkerQueueMode::LOCKBASED plays a role of the software queue now.
    i. We requested the software queue and you provided ttnn async runtime.

Answers to your question.

What’s the use case for multiple app threads using the same CQ?

Suppose that there are 4 threads, and each thread has its own software queue.

  1. WorkExecutorMode::AsyncMode and WorkerQueueMode::LOCKBASED are turned on.
  2. Assign CQ1 to two threads' software queues, and CQ2 to the other threads' software queues.
  3. Each thread push commands into worker_queue with its own cq-id.

This is the case when multiple app threads use the same CQ.

Is a Host Side Queue Flush needed during workload execution? We have record_event and wait_for_event APIs that allow device to sync CQs, without host in the loop. These will provide better performance, as application threads will not get blocked.

a Host Side Queue is a worker_queue? Then, it does not need to be flushed. As I said at the top, I did not use queue_syncrhonize. I used event APIs as you suggested.

If multiple threads push to the same CQ, how do we guarantee ordering? Does the application use a mutex + some ordering mechanism to ensure that a CQ gets commands in order?

  1. The ordering of commands in same software queue must be guaranteed.
  2. The ordering between commands in different software queues does not need to be guaranteed.
  3. The synchronization between software queues are up to the application with event APIs.

As explained, This is the case that multiple app threads use the same CQ.

  1. Multiple threads have its own software queue.
  2. The software queues just happen to be assigned the same CQ.

It means that the ordering of commands in the same software queue should be guaranteed. But the ordering of commands across different software queues does not need to be guaranteed.
When we need the synchronization across software queues, we use event APIs such as record_event, wait_event, event_synchronize.

@tt-asaigal
Copy link
Contributor

Hey @namhyeong-kim thanks for the explanation. I understand your use case better now.

a Host Side Queue is a worker_queue? Then, it does not need to be flushed. As I said at the top, I did not use queue_syncrhonize. I used event APIs as you suggested.

Yes the host side queue is the worker queue. ttnn::queue_synchronize will flush all tasks enqueued to the work executor.

Given that you're using event synchronization APIs to ensure ordering across software queues and worker threads, I have a change staged here, that moves the stall to the application thread. I've added a test here to verify functionality similar to what was described above.

This should take care of the first issue you mentioned, with the work-executor stalling and not being able to pick up work from other application threads, if one issues a host side sync.

tt-asaigal added a commit that referenced this issue Aug 20, 2024
tt-asaigal added a commit that referenced this issue Aug 21, 2024
tt-asaigal added a commit that referenced this issue Aug 23, 2024
tt-asaigal added a commit that referenced this issue Aug 23, 2024
tt-asaigal added a commit that referenced this issue Aug 23, 2024
jaykru-tt pushed a commit that referenced this issue Aug 29, 2024
@razorback3
Copy link
Contributor

Closing the issue and will re-open later if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
moreh moreh contribution P2
Projects
None yet
Development

No branches or pull requests

3 participants