Is there a good heuristic for understanding when to actually Prefect Community #ask-community

Is there a good heuristic for understanding when t...

Sean Talia

02/16/2022, 4:12 PM

Is there a good heuristic for understanding when to actually expect a performance improvement from converting a task that, say, operates over a list in sequence to a mapped task? I have a task that loops over hundreds of CSV files, parses and transforms the data, and writes the parsed data to an output file (one output file for each input file)...I was anticipating that converting this process to a mapped task and increasing my CPU count (I'm running this on Fargate via ECS, so I upped from 0.5 to 2 vCPUs in this specific case) would help out a lot by parallelizing this file-parsing procedure, but it actually has caused that node in my DAG to take much longer than it used to. I'm wondering if the overhead of this mapped task getting compiled and sent to the LocalDaskExecutor is just significantly outweighing the gains from the parallelization, because each mapped task is so small (the files being parsed are not large), and this would only make sense to do if the files were much, much larger?

Kevin Kho

02/16/2022, 4:16 PM

I think Dask often does not infer CPU count correctly. Did you pass the number of workers to the LocalDaskExecutor? That might help

Sean Talia

02/16/2022, 4:23 PM

ah I didn't explicitly pass the number of workers b/c I assumed from the documentation where it said

By default the number of threads or processes used is equal to the number of cores available.

, so if Fargate is giving it access to 2 vCPUs my assumption was that it would just infer that

num_workers=2

is the appropriate thing to do

Kevin Kho

02/16/2022, 4:26 PM

Ehhh ideally, but I have seen it not work right for many people. We use a Dask utility under the hood for that 🤷. I just recommend explicitly setting the number of workers.

👍 1

Sean Talia

02/16/2022, 4:51 PM

okay yeah it seems like that's helping a lot, it's still slower than doing it sequentially but I'm wondering if this is just because the files I'm working with are quite small; in my production workload they'd be quite a bit bigger

Kevin Kho

02/16/2022, 5:32 PM

From experience, there is just an overhead with reading/writing CSV files. Do you have something like

year/month/day.csv

and you are right to read these one at a time?

Kevin Kho

02/16/2022, 5:34 PM

Cuz I would just use DuckDB for that and it’s so much faster than pandas if you are using pandas

Sean Talia

02/16/2022, 5:43 PM

oh interesting, no this is a very legacy pipeline that I recently moved to prefect that just uses vanilla python, not even doing anything with pandas

Sean Talia

02/16/2022, 5:44 PM

but yeah that's exactly right, the CSVs are stored in S3 with

/year/month/day/hour/

prefixes

Kevin Kho

02/16/2022, 6:26 PM

Parquet would be so much better if your can, but I think duckdb will be super fast for this operation over vanilla Python (because it’s written in C++) too .

Sean Talia

02/17/2022, 3:12 PM

would using DuckDB for this mean that all of the transformation / manipulation of the CSV values that I was doing with python would now need to be converted to SQL? or the Python client's API is such that actually the code changes there would be pretty minimal?

Kevin Kho

02/17/2022, 3:29 PM

You can load in DuckDB and materialize to Pandas I think

Kevin Kho

02/20/2022, 5:49 PM

I think I change my mind on this, DuckDB seems slower by 2-3x for a lot of small files

Kevin Kho

02/22/2022, 1:19 AM

I benchmarked and used DuckDB here over the weekend if you are interested.

12 Views

Open in Slack

Previous Next