https://prefect.io logo
s

Sylvain Hazard

10/18/2021, 9:20 AM
Hello there ! I'd like to know about any limitation as to how large a mapped task can be. This does not have to be a fixed limit, an order of magnitude would do. Given an input list of size, let's say 1 million, would Prefect Server be able to map a task correctly ? I am running said flow on a LocalDaskExecutor. I believe it is relevant due to it going for a depth first execution. Thanks a bunch !
a

Anna Geller

10/18/2021, 9:51 AM
Hi @Sylvain Hazard, Iโ€™d say it depends to some extent on your database and infrastructure setup (e.g. DaskExecutor on Dask distributed can handle way more mapped tasks than a non-distributed LocalDaskExecutor). I will ask the team if they can tell you something more concrete, and get back to you. But you would probably get the most realistic number if you would run a test yourself on your DEV infrastructure.
s

Sylvain Hazard

10/18/2021, 9:53 AM
Seems like the answer I was looking for : it entirely depends on the execution cluster and there is no limit induced by the prefect core API, is that right ? Given unlimited scaling capabilities on a Dask cluster, I should be able to run an infinitely large map ?
a

Anna Geller

10/18/2021, 9:55 AM
correct, that what I would expect. However, Prefect needs to track the state of all those mapped tasks, which means it needs to register those child tasks in the database. So the database itself could be more of a limit than Dask. I will ask the team because Iโ€™m giving you a vague answer.
s

Sylvain Hazard

10/18/2021, 9:59 AM
The database is limiting only in a storage capacity sense as well as registering time though, right ? I'd expect to only be in trouble if registering a number of tasks that would fill my storage.
a

Anna Geller

10/18/2021, 10:00 AM
correct, afaik there is some queueing mechanism to batch-register child tasks, but to be honest I donโ€™t know (yet) how it works exactly
s

Sylvain Hazard

10/18/2021, 10:00 AM
Don't worry, my question was quite vague too ๐Ÿ˜… It was more of a question to quench my curiosity as well as vaguely remembering a mention to this in the DFE article on Medium.
๐Ÿ‘ 1
a

Anna Geller

10/18/2021, 10:03 AM
Your question was well-justified. Distributed computing is hard ๐Ÿ™‚
s

Sylvain Hazard

10/18/2021, 10:03 AM
Definitely is, that's why I am starting easy with a LocalDaskExecutor on a single node ๐Ÿ˜„
upvote 1
k

Kevin Kho

10/18/2021, 1:55 PM
Yeah this is mainly about the limits of your hardware and what the mapped task is doing. This was not on server, but I was testing mapping a list of 400,000 items on Prefect Cloud on a Local Agent on my laptop. The task barely just returned
x + 1
and it worked. From a registering standpoint, I think you need batching at 10000 tasks (but mapped tasks count as 1 for registration). The runtime will generate multiple tasks though and yes this can fill your storage. I think the thing to note though is that a lot of users unnecessarily do this. If you have a very high mapped task count, just make sure that you do need observability and retries for each of those individual elements. Otherwise, you might be able to do the operation with a DataFrame.
upvote 2