Hi everyone. I see lots of examples where people are using Pandas / SQL Alcehemy to do an extract / Load operation. What I haven’t seen are examples of how to handle data sets larger than memory for these operations. Do you advocate to running pyspark, Dask clusters or is there a mechanism to do something with ACI / ECS fargate so that a just in time just big enough worker can be launched?
a
Anna Geller
11/17/2022, 1:15 PM
hard to give recommendations without knowing the source format of the data and the destination
when you want to use pandas, loading in chunks might be a good option
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.