I’m experimenting with a preprocessing pipeline using a DaskExecutor using an ephemeral dask cluster within a k8s cluster.
I’m seeing this warning:
Copy code
//anaconda3/envs/sandbox/lib/python3.8/site-packages/distributed/worker.py:3862: UserWarning: Large object of size 1.65 MiB detected in task graph:
{'task': <Task: load_image_from_bucket>, 'state': ... _parent': True}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
Is it a bad idea to pass e.g. images around between prefect tasks?
k
Kevin Kho
09/06/2021, 5:46 PM
I don’t think scatter necessarily helps you here. Dask just sees a lot of data moving back and forth repetitively and is throwing the warning, but I’m not 100% sure.
Kevin Kho
09/06/2021, 5:51 PM
You can decrease the data transfer between workers by serializing your images with
pickle
, and then unpickle it on the worker side.
f
Fabio Grätz
09/06/2021, 8:31 PM
Ah, this is interesting, I thought the payload/data is pickled automatically when being communicated between tasks. Is this not correct?
k
Kevin Kho
09/06/2021, 8:40 PM
Oh you’re right. It does use
cloudpickle
to submit the work. In that case, I suppose the only thing you can do to reduce transfer is to persist things somewhere, but it doesn’t feel efficient either.
I think I’m trying to do the same as you are showing in this talk here 🙂 Has this code been published somwhere?
k
Kevin Kho
09/06/2021, 9:10 PM
Hahaha that’s the first time I’ve seen someone watch this talk. Check this . This was a bunch of effort since I don’t do image work. Glad someone is asking about it.
f
Fabio Grätz
09/07/2021, 10:44 AM
Thanks, will look at it!
k
Kevin Kho
09/07/2021, 1:44 PM
Saw the issue you opened. Will look later.
Kevin Kho
09/07/2021, 5:28 PM
How many workers are you using for the DaskExecutor btw? Maybe lowering it will force depth first execution?
Kevin Kho
09/07/2021, 10:24 PM
Btw, I don’t have an answer for you but here is the link for Dependent Flows in case you haven’t seen it yet.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.