matta
01/27/2021, 10:03 PMjames.lamb
01/27/2021, 10:17 PM.read_csv()
supports an argument blocksize
, which I think you can use for the purpose you're asking about: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv.
If it isn't straightforward to calculate a good blocksize
, Dask DataFrame also supports a .repartition()
. If you read that whole CSV into a Dask DataFrame, you could then repartition it in to the sizes you want: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.repartitionmatta
01/27/2021, 10:20 PMto_csv
to make sure that each chunk (after compression with gzip or bz2 or whatever) is, say, under 100mbjames.lamb
01/27/2021, 10:38 PMto_csv()
will create one file per partition (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
So you could get fairly close. You could do an experiment where you write to some files with .to_csv()
and your preferred compression type, figure out the compression ratio, and then use that to get a rough estimate of your partition size to use with .repartition()
matta
01/27/2021, 10:40 PM