I'm using Prefect 2 to build a data pipeline and have a couple of general design questions:
• I assume we should be thinking about holding data in S3 and passing it between flows by reference – i.e. rather than as parameters
• If so, should I use S3 blocks to store intermediate results, or interact with S3 directly? What are the trade-offs?
• I'm planning on using Task.map to parallelise work (using dask); there's no equivalent for Flows, so I guess we should be thinking of parallelisation only happening within flows, is that right?
08/15/2022, 5:38 PM
#1 it depends on the size of your data and your preference - with Prefect, you can just pass data between tasks as long as your execution env doesn't throw OOM errors -- this is in contrast to many other tools that don't support passing data between tasks
#2 S3 blocks work and can actually provide you more observability later on (a feature we are working on), but if this doesn't fit into your workflow e.g. if you leverage specific boto3/awswrangler to persist data, then you may go with that, up to you
#3 there is mapping in 2.0 so you can totally use that to process data in parallel e.g. using ConcurrentTaskRunner
08/15/2022, 6:06 PM
This is the same use case I wanted custom result types for (specifically spark partitioned datasets far too big to pass in memory)
08/15/2022, 8:03 PM
I should have clarified, sorry
parallelism happens with a task runner, so it only works for tasks
so @James Brady you are 100% correct that parallelism should be handled within a flow, and you can attach the same task runner type (ask, ray, concurrent) to multiple subflows when needed