Thread
#prefect-community
    Fabio Grätz

    Fabio Grätz

    1 year ago
    I’m experimenting with a preprocessing pipeline using a DaskExecutor using an ephemeral dask cluster within a k8s cluster. I’m seeing this warning:
    //anaconda3/envs/sandbox/lib/python3.8/site-packages/distributed/worker.py:3862: UserWarning: Large object of size 1.65 MiB detected in task graph:
      {'task': <Task: load_image_from_bucket>, 'state':  ... _parent': True}
    Consider scattering large objects ahead of time
    with client.scatter to reduce scheduler burden and
    keep data on workers
    Is it a bad idea to pass e.g. images around between prefect tasks?
    Kevin Kho

    Kevin Kho

    1 year ago
    I don’t think scatter necessarily helps you here. Dask just sees a lot of data moving back and forth repetitively and is throwing the warning, but I’m not 100% sure.
    You can decrease the data transfer between workers by serializing your images with
    pickle
    , and then unpickle it on the worker side.
    Fabio Grätz

    Fabio Grätz

    1 year ago
    Ah, this is interesting, I thought the payload/data is pickled automatically when being communicated between tasks. Is this not correct?
    Kevin Kho

    Kevin Kho

    1 year ago
    Oh you’re right. It does use
    cloudpickle
    to submit the work. In that case, I suppose the only thing you can do to reduce transfer is to persist things somewhere, but it doesn’t feel efficient either.
    Fabio Grätz

    Fabio Grätz

    1 year ago

    https://www.youtube.com/watch?v=SIvLhp7aLBw&amp;ab_channel=Heartex

    I think I’m trying to do the same as you are showing in this talk here 🙂 Has this code been published somwhere?
    Kevin Kho

    Kevin Kho

    1 year ago
    Hahaha that’s the first time I’ve seen someone watch this talk. Check this . This was a bunch of effort since I don’t do image work. Glad someone is asking about it.
    Fabio Grätz

    Fabio Grätz

    1 year ago
    Thanks, will look at it!
    Kevin Kho

    Kevin Kho

    1 year ago
    Saw the issue you opened. Will look later.
    How many workers are you using for the DaskExecutor btw? Maybe lowering it will force depth first execution?
    Btw, I don’t have an answer for you but here is the link for Dependent Flows in case you haven’t seen it yet.