Is there a particular "Dask-ic" way to split a big...
# ask-community
m
Is there a particular "Dask-ic" way to split a big CSV and make sure the compressed chunks are a specific size?
j
Dask DataFrame's
.read_csv()
supports an argument
blocksize
, which I think you can use for the purpose you're asking about: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv. If it isn't straightforward to calculate a good
blocksize
, Dask DataFrame also supports a
.repartition()
. If you read that whole CSV into a Dask DataFrame, you could then repartition it in to the sizes you want: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.repartition
m
Thanks! I'm looking to make sure that the chunks are a specific size after I save them and compress them, though. So like some arguments to give
to_csv
to make sure that each chunk (after compression with gzip or bz2 or whatever) is, say, under 100mb
j
ah I see. I don't think Dask DataFrame exactly allows you to do that (especially peaking ahead at the after-compression size), but
to_csv()
will create one file per partition (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv). So you could get fairly close. You could do an experiment where you write to some files with
.to_csv()
and your preferred compression type, figure out the compression ratio, and then use that to get a rough estimate of your partition size to use with
.repartition()
m
Cool, thanks!