matta
01/27/2021, 10:03 PMjames.lamb
01/27/2021, 10:17 PM.read_csv() supports an argument blocksize, which I think you can use for the purpose you're asking about: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv.
If it isn't straightforward to calculate a good blocksize, Dask DataFrame also supports a .repartition() . If you read that whole CSV into a Dask DataFrame, you could then repartition it in to the sizes you want: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.repartitionmatta
01/27/2021, 10:20 PMto_csv to make sure that each chunk (after compression with gzip or bz2 or whatever) is, say, under 100mbjames.lamb
01/27/2021, 10:38 PMto_csv() will create one file per partition (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
So you could get fairly close. You could do an experiment where you write to some files with .to_csv() and your preferred compression type, figure out the compression ratio, and then use that to get a rough estimate of your partition size to use with .repartition()matta
01/27/2021, 10:40 PM