Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi :prefect:
When I run a fairly simple Flow (get data -&gt; check if it exists -&gt; save to a couple of places), if I map it such the recurrent Flow need to run over 15k+ times, why does it take over 60 min to start the flow task which are mapped ?
I guess it’s an expected behaviour (I am running a single worker Dask executor), but I want to understand why this delay happens with the hope of optimizing it a bit.
Thanks : )

Hi Stelian!

Just to clarify, it takes 60 minutes to start _all_ of the mapped tasks?

Hi <@UKVFX6N3B>, so I have mapped and non-mapped tasks. The non-mapped ones (which are before the mapped usually) start straight away. Then before the first mapped task starts, it takes a long time - 60 min in the case of 15k mapped tasks. More or less if more or less mappings exist.

What version of Prefect are you running on?

I’d suggest either:
• Sizing up your Dask worker resources
• Increasing the number of Dask workers

So increase the server essentially to a bigger one and then add more workers?

okay, but why is there a bottleneck here? Is this expected? Can it be improved significantly on my end (user) or on your end?

Hi <@U01C08BUYAC> - can you give us some idea of your setup?
• Are you running on kubernetes or something else?
• How large are your tasks? What's the size of the output of `cloudpickle.dumps(your_mapped_task)` in bytes? 


Hi <@UN6FTLFAS>:
• I am running a Dask Executor on a simple server (no Kubernetes)
• I don’t quite understand your question? How large i the data I move between the tasks?

Got it <@U01C08BUYAC> - it's likely that your Dask cluster is starved for resources with that many tasks. Tasks that are serializing lots of data and passing it between them or large tasks graphs, resource starved schedulers/workers/clients, or even poorly structured code can all be bottlenecks. In your case it sounds like there's a combination of all the above. For now I'd follow <@UKVFX6N3B>’s advice and look at the resources you're giving your scheduler/worker

I'd also take a look at some of the Dask resources for configuring your Dask cluster, you can find really good docs for that <https://blog.dask.org/2020/07/30/beginners-config|here> and <https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters|here>.

Okay, thank you very much. I will follow through.
The only additional Q I have is: when you say “bad code” what do you mean and refer to? Is it the flow structure that can potentially be optimized or the code inside the taks? If the latter, it does not work until 60 min after the start :thinking_face:

Either of those could be issues (flow structure or intra-task code) and could be optimized for performance but like I mentioned I don't think this is your issue at the moment since your tasks aren't starting in the first place.

Hi <@UN6FTLFAS> I did try with a huge server and the have the same results, so I guess it’s some settings somewhere.

Can you please give me some guide and where / what to look for to address this?

Hi <@U01C08BUYAC>, did you take a look at the links I posted above for configuring your Dask Cluster? <https://blog.dask.org/2020/07/30/beginners-config|This one> and <https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters|this one>.

I have not. I will check hem first then come back to this if necessary. Thanks!