Can Conductor handle 100 million workflows per day? #2299

rickfish · 2021-06-09T17:42:37Z

rickfish
Jun 9, 2021

Currently we are using Conductor with a Postgres backend and Conductor is deployed on OpenShift. We are processing about 300,000 workflows per day that create around 1.5 million tasks in total. Performance is good and there is room for additional growth.

We have recently been approached by a team in our organization that wants to use Conductor to start 100 million workflows per day consistently. I am pretty sure that our current setup will not handle that volume as far as Postgres and our http infrastructure goes.

My question is: I believe Netflix has Conductor deployed on AWS and is using DynamoDB as the data store. Is this correct? If so, do you think it is possible to construct an environment that can handle this kind of volume and, if so, what do we need to consider?

kishorebanala · 2021-06-15T00:08:45Z

kishorebanala
Jun 15, 2021
Collaborator

Hey @rickfish , Interesting question.

I believe Netflix has Conductor deployed on AWS and is using DynamoDB as the data store. Is this correct?

Yes, we run Conductor on AWS. And No, we use Cassandra for persistence and Queuing layer built on top of Redis.

If so, do you think it is possible to construct an environment that can handle this kind of volume and, if so, what do we need to consider?

I guess running benchmarks are the only way to find the answer. But, some of the theoretical limitations we could think of while scaling Conductor are:

Persistence layer / Queuing layer throughput
Scaling decider queue using partitions (which is something we're also considering)
Writes to Elastic search are through in-memory queue; At this volume, we might easily outrun in-memory queues.
Handling task poll volume
etc.

Please let us know your findings if you pursue this; We're very curious to learn more and see if we can apply these learning where possible.

0 replies

Jiehong · 2021-09-11T10:32:12Z

Jiehong
Sep 11, 2021

@rickfish : any news on this? I'm interested in the numbers.

We also run conductor (with many things added on top) on openshift, but with an Oracle db so far.
The decider queue scaling is the bit we are a bit afraid of at this moment.
Although, our volume is 10 times lower than yours so far.

0 replies

v1r3n · 2021-09-15T17:57:46Z

v1r3n
Sep 15, 2021

@rickfish 100M per day is certainly possible with Conductor. The key things to consider here are going to:

Queuing backend: Redis Cluster based solution will be the most suitable here given it will be able to shard your workload better
Storage backend: Given the volume - Postgres will be fine for workflow archival, but for the active workflows either you want to shard it or use Redis

Curious to know if you ran some benchmarks and would love to see the numbers if possible.

cc: @Jiehong

0 replies

anbxtres · 2021-09-18T20:15:33Z

anbxtres
Sep 18, 2021

+1 with the right setup, 100M is easy on this platform.

2 replies

yanghaogn Mar 2, 2022

How to define the num of workflow? If 3 workflowId use one workflow name, is the num 1 or 3 ?

CherishSantoshi Mar 23, 2022

@anbxtres Curious to know if you've done or currently doing 100M workflows/day.
If yes, what is your setup?

yanghaogn · 2022-03-02T08:59:10Z

yanghaogn
Mar 2, 2022

How to define the num of workflow? If 3 workflowId use one workflow name, is the num 1 or 3 ?

4 replies

v1r3n Mar 2, 2022

Do you mean the workflowId? WorkflowIds are random UUIDs with no particular ordering associated with them (ie given two ids you cannot compare them for ordering). Each workflow has a start and end time associated though that shows when the workflow started and when the execution completed.

yanghaogn Mar 2, 2022

I'm wondering the question of Can Conductor handle 100 million workflows per day?

What does 100 million workflows mean.

Is it 1 or 3 when 3 workflowId use "add_netflix_identation"?

v1r3n Mar 2, 2022

Yes 100M is 100M workflow executions. So for the example above there could be 100M invocation of "add_netflix_identation" workflows.

yanghaogn Mar 2, 2022

Thanks for your answer.

shahmanudilip · 2022-03-17T16:45:30Z

shahmanudilip
Mar 17, 2022

@rickfish Can you please also detail out your configuration using which you are running 300,000 workflows per day. An insight into the following will be very helpful

Number of conductor servers running with memory and compute. Are all servers running system tasks? Is a load balancer sitting in front if in HA
Postgres version with memory and compute
Number of worker processes polling conductor

Thanks

4 replies

rickfish Mar 22, 2022
Author

@shahmanudilip We are running Conductor in OpenShift. We have split it into 2 applications based on configuration...We have 2 pods that just run the background threads to do the sweeping and 4 pods that process the REST service requests. The background pods are allocated 2GB of memory and 1 virtual CPU and we configure 40 sweeper threads each. The REST pods are allocated 2.5GB of memory and 2 virtual CPUs.

We are on v12.7 of Postgres. The database is taking up about 3TB, not sure on memory allocated as our database group locks that down so I can't query it. We have a maximum of 750 connections available and typically use about 150 at any given time but i have seen connections reach 500 sometimes.

I am not sure of the number of worker processes as Conductor is an enterprise-wide resource and teams in our organization can deploy workers on their own. I would estimate from 50 to 75 workers are polling.

jamesgunja Apr 15, 2022

@rickfish Thanks for posting this question and insights on your setup. Curious, what are you using for queuing layer here?

rickfish Apr 15, 2022
Author

@jamesgunja my discussion started with my last customer who was using Postgres for persistence as well as queueing. I am no longer there.

coderrr22 Jun 13, 2022

Hi, @rickfish thanks for the detailed answer, I would like to know which configurations/settings did you change in order to have 4 Pods that only handle the HTTP service requests ? In our conductor we have a problem that HTTP tasks are taking way too long to get picked up and start executing, and we are looking for possible solutions.

rickfish · 2022-06-14T18:47:43Z

rickfish
Jun 14, 2022
Author

@coderrr22 by HTTP service requests I don't mean HTTP tasks, I mean any of the Conductor REST service requests. We have two Conductor instances deployed on Kubernetes (OpenShift). Each is configured differently. One only runs the background threads and is allocated 2 pods (40 sweeper threads each) and one handles the REST service requests and has 4 pods. We use the following config (using 2.31 version of conductor) to disable the background threads on the REST instance:
decider.sweep.disable=true
workflow.event.processor.thread.count=0
workflow.system.task.worker.thread.count=0
workflow.sweeper.thread.count=0

The background thread instance doesn't get forwarded any REST requests.

1 reply

mariomartucci Jul 8, 2022

Hi, can you give me more information on how to run conductor with 2 pods? Thank you.

rickfish · 2022-07-08T13:58:50Z

rickfish
Jul 8, 2022
Author

@mariomartucci 2 pods is a configuration setting in the kubernetes deployment config file that we use when deploying to RedHat OpenShift (the spec.replicas attribute I think)

3 replies

mariomartucci Jul 11, 2022

Hi @rickfish , can you share me conductor configuration in HA mode? Thank you

rickfish Jul 11, 2022
Author

@mariomartucci not exactly sure that this is 'HA mode' but this is how we set it up on OpenShift:

We have two Conductor 'applications' deployed in the same Kubernetes (OpenShift) namespace. Each is configured differently. One only runs the background threads and is allocated 2 pods (replicas=2) - 40 sweeper threads each and one handles the REST service requests and has 4 pods. We use the following config (using 2.31 version of conductor) to disable the background threads on the REST instance:
decider.sweep.disable=true
workflow.event.processor.thread.count=0
workflow.system.task.worker.thread.count=0
workflow.sweeper.thread.count=0

The background thread instance doesn't get forwarded any REST requests.

mark91m12 Jul 13, 2022

hi @rickfish , can you tell me where i can find documentation about the configuration that you show above?

rickfish · 2022-07-13T21:31:48Z

rickfish
Jul 13, 2022
Author

@mark91m12 I got those properties while going through the code, not sure if they are all documented somewhere. Keep in mind this is the 2.31 version of Conductor not the 3.x so these property names probably changed.

2 replies

mark91m12 Jul 18, 2022

hi @rickfish, first of all thank you for your reply, yes i saw all these configuration variables in the CHANGELOG file (with new names relative to the version 3.x), i really would appriciate if you give me more information about your installation through a direct channel, can we schedule a meeting or something like that?

mark91m12 Sep 2, 2022

hi @rickfish , as i said on july i would like to spoke with you, i leave here my email in case you want to discuss about your configuration and installation. (marco.amato@altiliagroup.com)

rickfish · 2022-10-11T07:25:54Z

rickfish
Oct 11, 2022
Author

Hi Marco. We can meet if you like. Not sure what software to use for that. You decide. Keep in mind that the work I did in Conductor was for my former client so I will have to remember some things...

…

On Mon, Jul 18, 2022 at 4:50 AM Marco Amato ***@***.***> wrote: hi @rickfish <https://github.com/rickfish>, first of all thank you for your reply, yes i saw all these configuration variables in the CHANGELOG file (with new names relative to the version 3.x), i really would appriciate if you give me more information about your installation through a direct channel, can we schedule a meeting or something like that? — Reply to this email directly, view it on GitHub <#2299 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMH6WBNGRA4J6MC77JMUFTVUULE5ANCNFSM46MP2RYA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

jamesgunja · 2022-10-11T14:56:58Z

jamesgunja
Oct 11, 2022

How many of you will be interested in having a conductor as a k8s operator? I can spend some time to start off the ground if there is good interest in it.

1 reply

egandro Aug 13, 2023

@jamesgunja willing to help working with you on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Conductor handle 100 million workflows per day? #2299

{{title}}

Replies: 11 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can Conductor handle 100 million workflows per day? #2299

Replies: 11 comments · 17 replies

kishorebanala Jun 15, 2021 Collaborator

rickfish Mar 22, 2022 Author

rickfish Apr 15, 2022 Author

rickfish Jun 14, 2022 Author

rickfish Jul 8, 2022 Author

rickfish Jul 11, 2022 Author

rickfish Jul 13, 2022 Author

rickfish Oct 11, 2022 Author

Replies: 11 comments 17 replies

kishorebanala
Jun 15, 2021
Collaborator

rickfish Mar 22, 2022
Author

rickfish Apr 15, 2022
Author

rickfish
Jun 14, 2022
Author

rickfish
Jul 8, 2022
Author

rickfish Jul 11, 2022
Author

rickfish
Jul 13, 2022
Author

rickfish
Oct 11, 2022
Author