I’m experimenting with a preprocessing pipeline us...
# ask-community
f
I’m experimenting with a preprocessing pipeline using a DaskExecutor using an ephemeral dask cluster within a k8s cluster. I’m seeing this warning:
Copy code
//anaconda3/envs/sandbox/lib/python3.8/site-packages/distributed/worker.py:3862: UserWarning: Large object of size 1.65 MiB detected in task graph:
  {'task': <Task: load_image_from_bucket>, 'state':  ... _parent': True}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
Is it a bad idea to pass e.g. images around between prefect tasks?
k
I don’t think scatter necessarily helps you here. Dask just sees a lot of data moving back and forth repetitively and is throwing the warning, but I’m not 100% sure.
You can decrease the data transfer between workers by serializing your images with
pickle
, and then unpickle it on the worker side.
f
Ah, this is interesting, I thought the payload/data is pickled automatically when being communicated between tasks. Is this not correct?
k
Oh you’re right. It does use
cloudpickle
to submit the work. In that case, I suppose the only thing you can do to reduce transfer is to persist things somewhere, but it doesn’t feel efficient either.
f

https://www.youtube.com/watch?v=SIvLhp7aLBw&amp;ab_channel=Heartex

I think I’m trying to do the same as you are showing in this talk here 🙂 Has this code been published somwhere?
k
Hahaha that’s the first time I’ve seen someone watch this talk. Check this . This was a bunch of effort since I don’t do image work. Glad someone is asking about it.
f
Thanks, will look at it!
k
Saw the issue you opened. Will look later.
How many workers are you using for the DaskExecutor btw? Maybe lowering it will force depth first execution?
Btw, I don’t have an answer for you but here is the link for Dependent Flows in case you haven’t seen it yet.