Conductor HA race-condition for tasks completing within milliseconds #2333

guru1306 · 2021-06-24T16:28:31Z

guru1306
Jun 24, 2021

Hi All

We are using AMQP Message Queues to run our tasks with Dynomite as the backend. The queue is used to send tasks to the workers and receive completion notification for the conductor.

Here is the flow.
As part of scheduling , the Conductor is doing the below steps in one thread.
Thread 1:
Step 1: Workflow Executor's schedule is Publishing the task. The type of task is an event and async complete is set to true
Step 2: Update the task to in progress as the async complete is set to true

This is the code corresponding to the above steps.

conductor/core/src/main/java/com/netflix/conductor/core/execution/WorkflowExecutor.java

Line 1566 in b8ca41c

workflowSystemTask.start(workflow, task, this);

When it receives a task completion notification on the AMQP message queue, here are the steps.
Thread 2:
Step 1: SimpleEventProcessor class receives task completion notification
Step 2: Update the task to completion state.

The complete task method updates the task to complete state

conductor/core/src/main/java/com/netflix/conductor/core/events/SimpleActionProcessor.java

Line 77 in b8ca41c

    
           private Map<String, Object> completeTask(Action action, Object payload, TaskDetails taskDetails, Status status,

Now we are facing a problem due to the above-mentioned logic. Some of the tasks get completed in milliseconds and the task completion notification comes within milliseconds.
Intermittently , the update to completion by T2 happens before the update to the in-progress state by T1 as there are no locks. The end state becomes in progress state. The decider finally incorrectly marks the task status as completed with errors due to time out. There are no common locks to prevent the incorrect update arising out of race conditions between two threads.

Proposed solutions:
Solution 1: Use the GET SET command in the Redis to get the task before the update. The conductor can check if the in-progress update has not happened after the status is set to complete state. If the status was already complete, the Redis DAO would set it back to a complete state. But this would lead to adding intelligence at the DAO layer so that the status cannot change from complete to in-progress.
Redis GET SET is atomic. For more details please refer to this link.
https://redis.io/commands/GETSET

Solution 2: Use the same strategy as Solution 1 but move the logic of retrying to WorkflowExecutor. Add another method in ExecutionDAO to return the value stored for the task before updating using GET SET. WorkflowExecutor can check if the state is desired. If not then set the required state back.

Solution 3: The scheduled task method is running as part of decide already. Use the acquire lock while updating the task to a completed state. But it's not a blocking call now hence the update to completion has to wait until it acquires a lock. But the lock is not enabled by default and a Zookeeper is needed for the distributed lock. Hence it will not work for everyone using the conductor. This will also increase the time to acknowledge the message once the task is marked as completed.

I am leaning towards Solution 2. Please let me know your thoughts.

Thanks
Guru

apanicker-nflx · 2021-07-01T00:53:58Z

apanicker-nflx
Jul 1, 2021
Maintainer

Moved to issue #2336

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conductor HA race-condition for tasks completing within milliseconds #2333

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Conductor HA race-condition for tasks completing within milliseconds #2333

guru1306 Jun 24, 2021

Replies: 1 comment

apanicker-nflx Jul 1, 2021 Maintainer

guru1306
Jun 24, 2021

apanicker-nflx
Jul 1, 2021
Maintainer