I m using Prefect 2 to build a data pipeline and have a coup Prefect Community #ask-community

I'm using Prefect 2 to build a data pipeline and h...

James Brady

08/15/2022, 4:58 PM

I'm using Prefect 2 to build a data pipeline and have a couple of general design questions: • I assume we should be thinking about holding data in S3 and passing it between flows by reference – i.e. rather than as parameters • If so, should I use S3 blocks to store intermediate results, or interact with S3 directly? What are the trade-offs? • I'm planning on using Task.map to parallelise work (using dask); there's no equivalent for Flows, so I guess we should be thinking of parallelisation only happening within flows, is that right?

Anna Geller

08/15/2022, 5:38 PM

#1 it depends on the size of your data and your preference - with Prefect, you can just pass data between tasks as long as your execution env doesn't throw OOM errors -- this is in contrast to many other tools that don't support passing data between tasks #2 S3 blocks work and can actually provide you more observability later on (a feature we are working on), but if this doesn't fit into your workflow e.g. if you leverage specific boto3/awswrangler to persist data, then you may go with that, up to you #3 there is mapping in 2.0 so you can totally use that to process data in parallel e.g. using ConcurrentTaskRunner

Evan Curtin

08/15/2022, 6:06 PM

This is the same use case I wanted custom result types for (specifically spark partitioned datasets far too big to pass in memory)

Anna Geller

08/15/2022, 8:03 PM

I should have clarified, sorry parallelism happens with a task runner, so it only works for tasks

Anna Geller

08/15/2022, 8:04 PM

so @James Brady you are 100% correct that parallelism should be handled within a flow, and you can attach the same task runner type (ask, ray, concurrent) to multiple subflows when needed

Anna Geller

08/15/2022, 8:05 PM

@Evan Curtin it sounds like this integration might be interesting for you https://prefect-community.slack.com/archives/CKNSX5WG3/p1660241561151399

👀 1

👍 1

Anna Geller

08/15/2022, 8:07 PM

also, if you want to trigger multiple subflows in parallel, you can follow this approach https://discourse.prefect.io/t/how-can-i-run-multiple-subflows-or-child-flows-in-parallel/96

upvote 1

James Brady

08/16/2022, 6:07 AM

Great, thanks @Anna Geller!

🙌 1

3 Views

Open in Slack

Previous Next