This repository has been archived by the owner on Dec 13, 2023. It is now read-only.
Conductor HA race-condition for tasks completing within milliseconds #2333
Locked
guru1306
started this conversation in
Show and tell
Replies: 1 comment
-
Moved to issue #2336 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi All
We are using AMQP Message Queues to run our tasks with Dynomite as the backend. The queue is used to send tasks to the workers and receive completion notification for the conductor.
Here is the flow.
As part of scheduling , the Conductor is doing the below steps in one thread.
Thread 1:
Step 1: Workflow Executor's schedule is Publishing the task. The type of task is an event and async complete is set to true
Step 2: Update the task to in progress as the async complete is set to true
This is the code corresponding to the above steps.
conductor/core/src/main/java/com/netflix/conductor/core/execution/WorkflowExecutor.java
Line 1566 in b8ca41c
When it receives a task completion notification on the AMQP message queue, here are the steps.
Thread 2:
Step 1: SimpleEventProcessor class receives task completion notification
Step 2: Update the task to completion state.
The complete task method updates the task to complete state
conductor/core/src/main/java/com/netflix/conductor/core/events/SimpleActionProcessor.java
Line 77 in b8ca41c
Now we are facing a problem due to the above-mentioned logic. Some of the tasks get completed in milliseconds and the task completion notification comes within milliseconds.
Intermittently , the update to completion by T2 happens before the update to the in-progress state by T1 as there are no locks. The end state becomes in progress state. The decider finally incorrectly marks the task status as completed with errors due to time out. There are no common locks to prevent the incorrect update arising out of race conditions between two threads.
Proposed solutions:
Solution 1: Use the GET SET command in the Redis to get the task before the update. The conductor can check if the in-progress update has not happened after the status is set to complete state. If the status was already complete, the Redis DAO would set it back to a complete state. But this would lead to adding intelligence at the DAO layer so that the status cannot change from complete to in-progress.
Redis GET SET is atomic. For more details please refer to this link.
https://redis.io/commands/GETSET
Solution 2: Use the same strategy as Solution 1 but move the logic of retrying to WorkflowExecutor. Add another method in ExecutionDAO to return the value stored for the task before updating using GET SET. WorkflowExecutor can check if the state is desired. If not then set the required state back.
Solution 3: The scheduled task method is running as part of decide already. Use the acquire lock while updating the task to a completed state. But it's not a blocking call now hence the update to completion has to wait until it acquires a lock. But the lock is not enabled by default and a Zookeeper is needed for the distributed lock. Hence it will not work for everyone using the conductor. This will also increase the time to acknowledge the message once the task is marked as completed.
I am leaning towards Solution 2. Please let me know your thoughts.
Thanks
Guru
Beta Was this translation helpful? Give feedback.
All reactions