Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi everyone. I see lots of examples where people are using Pandas / SQL Alcehemy to do an extract / Load operation.  What I haven’t seen are examples of how to handle data sets larger than memory for these operations.  Do you advocate to running pyspark, Dask clusters or is there a mechanism to do something with ACI / ECS fargate so that a just in time just big enough worker can be launched?

hard to give recommendations without knowing the source format of the data and the destination

when you want to use pandas, loading in chunks might be a good option