Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

I mainly use Prefect for forecasting. An important feature that I am looking for is to pass data (large pandas dataframe of about 1.5GB in csv, or numpy array) between tasks. Does Prefect already support serialising significant data between tasks? I saw something called PandasSerializer, but not sure what it is for.

For reference, Metaflow and Flyte serialise pandas dataframe to parquet and store file in S3, then load that file again in the next task.

Hey <@U021H3MMHQX>, for temporary resources, check <https://docs.prefect.io/core/idioms/resource-manager.html|this link> . You can use Prefect to spinup the hardware are use that more the jobs. This way, the memory management is pushed to Dask. Prefect does not handle garbage collection well at the moment. We do have users that write out to S3 and load it in for downstream tasks. Some users code it on their own with boto3, and some use our <https://docs.prefect.io/core/concepts/results.html#result-objects|Results API> , which has an S3Result which uses boto3 underneath. PandasSerializer is used in conjunction with the ResultsAPI to serialize before saving.

On the ECSAgent, I think it depends. Yes you might be able to get it working with the LocalAgent, but I think the ECSAgent has stuff exposed like the task definition YAML that might make it easier to interact with Fargate. Why are you thinking of moving away from ECS agent?

Hi Kevin, I did not plan to move away from ECS agent. I simply explore all the options.
The Dask cluster example provided only create a Local Dask cluster. I simply thought that if I can provision a real distributed dask cluster (with many EC2 instances) on demand (as Fargate tasks), then I can still push the workload to dask cluster instead of using an ECS cluster directly.

But it seems creating and destroying a dask cluster is not as simple

I think that it is doable to provision Dask cluster on demand. We can simply start a few ECS tasks as Dask workers, and collect the IP addresses for the main flow run (which can run on another ECS task in the same cluster).