I’m experimenting with a preprocessing pipeline using a DaskExecutor using an ephemeral dask cluster...

Fabio Grätz

09/06/2021, 5:25 PM

I’m experimenting with a preprocessing pipeline using a DaskExecutor using an ephemeral dask cluster within a k8s cluster. I’m seeing this warning:

Copy code

//anaconda3/envs/sandbox/lib/python3.8/site-packages/distributed/worker.py:3862: UserWarning: Large object of size 1.65 MiB detected in task graph:
  {'task': <Task: load_image_from_bucket>, 'state':  ... _parent': True}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

Is it a bad idea to pass e.g. images around between prefect tasks?

Kevin Kho

09/06/2021, 5:46 PM

I don’t think scatter necessarily helps you here. Dask just sees a lot of data moving back and forth repetitively and is throwing the warning, but I’m not 100% sure.

Kevin Kho

09/06/2021, 5:51 PM

You can decrease the data transfer between workers by serializing your images with

pickle

, and then unpickle it on the worker side.

Fabio Grätz

09/06/2021, 8:31 PM

Ah, this is interesting, I thought the payload/data is pickled automatically when being communicated between tasks. Is this not correct?

Kevin Kho

09/06/2021, 8:40 PM

Oh you’re right. It does use

cloudpickle

to submit the work. In that case, I suppose the only thing you can do to reduce transfer is to persist things somewhere, but it doesn’t feel efficient either.

Fabio Grätz

09/06/2021, 9:09 PM

https://www.youtube.com/watch?v=SIvLhp7aLBw&ab_channel=Heartex▾

I think I’m trying to do the same as you are showing in this talk here 🙂 Has this code been published somwhere?

Kevin Kho

09/06/2021, 9:10 PM

Hahaha that’s the first time I’ve seen someone watch this talk. Check this . This was a bunch of effort since I don’t do image work. Glad someone is asking about it.

Fabio Grätz

09/07/2021, 10:44 AM

Thanks, will look at it!

Kevin Kho

09/07/2021, 1:44 PM

Saw the issue you opened. Will look later.

Kevin Kho

09/07/2021, 5:28 PM

How many workers are you using for the DaskExecutor btw? Maybe lowering it will force depth first execution?

Kevin Kho

09/07/2021, 10:24 PM

Btw, I don’t have an answer for you but here is the link for Dependent Flows in case you haven’t seen it yet.

5 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.