Is there a particular Dask ic way to split a big CSV and mak Prefect Community #ask-community

Is there a particular "Dask-ic" way to split a big...

matta

01/27/2021, 10:03 PM

Is there a particular "Dask-ic" way to split a big CSV and make sure the compressed chunks are a specific size?

james.lamb

01/27/2021, 10:17 PM

Dask DataFrame's

.read_csv()

supports an argument

blocksize

, which I think you can use for the purpose you're asking about: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv. If it isn't straightforward to calculate a good

blocksize

, Dask DataFrame also supports a

.repartition()

. If you read that whole CSV into a Dask DataFrame, you could then repartition it in to the sizes you want: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.repartition

matta

01/27/2021, 10:20 PM

Thanks! I'm looking to make sure that the chunks are a specific size after I save them and compress them, though. So like some arguments to give

to_csv

to make sure that each chunk (after compression with gzip or bz2 or whatever) is, say, under 100mb

james.lamb

01/27/2021, 10:38 PM

ah I see. I don't think Dask DataFrame exactly allows you to do that (especially peaking ahead at the after-compression size), but

to_csv()

will create one file per partition (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv). So you could get fairly close. You could do an experiment where you write to some files with

.to_csv()

and your preferred compression type, figure out the compression ratio, and then use that to get a rough estimate of your partition size to use with

.repartition()

matta

01/27/2021, 10:40 PM

Cool, thanks!

2 Views