Tom Forbes
06/29/2021, 1:23 PMimport dask.dataframe as dd
dataframe = dd.read_parquet("<s3://foo/bar>")
dataframe = dataframe.apply(do_something_expensive, axis="columns")
dataframe.to_parquet("<s3://foo/bar2>").execute()
The question is: do Prefect mapping tasks help here? As I understand it a reduce task in Prefect cannot be done on each partition, so it needs the full set of inputs to be passed to it. If those inputs don’t fit in memory then it will fail? I must be misunderstanding something, because wouldn’t this limitation be quite… limiting?Kevin Kho
Spencer
06/29/2021, 1:26 PMTom Forbes
06/29/2021, 1:27 PMTom Forbes
06/29/2021, 1:27 PMTom Forbes
06/29/2021, 1:28 PMoutputs = mapping_task.map(inputs=inputs)
save_to_s3_task.reduce(outputs, size=100)
to invoke save_to_s3
with 100 elements at a time, as mapping_task
is completed. or somesuch.Kevin Kho
[[1,2],[3,4],[5,6]]
. Did this not work for your use case?Tom Forbes
06/29/2021, 1:33 PMKevin Kho
Kevin Kho
Tom Forbes
06/29/2021, 1:55 PMTom Forbes
06/29/2021, 1:57 PM