Chris Hart

    Chris Hart

    2 years ago
    I've got a task that outputs a trained ML model, which has already been serialized due to running on the DaskExecutor.. is there a way to load the already cloudpickled object for a following persistence task to just save it directly rather than re-serializing? (or I may be misunderstanding and the only serialized thing floating around is the whole task itself, which should mean that i need to serialize the model again but that's ok)?
    i wonder if calling cloudpickle.loads(), grabbing the model object from inside the task, then re-serializing it is any better... probably not, cloudpickle would need to have some kind of fancy getter but on second thought that seems impossible.
    (unless anyone reading happens to have a strong opinion about using existing pickles directly at runtime - feels like an antipattern, is it?)
    Kyle Moon-Wright

    Kyle Moon-Wright

    2 years ago
    Hmm I’m no pickle farmer, but perhaps you’ll find some insight here: usage of
    cloudpickle.load()
    in various open-source projects.
    Chris White

    Chris White

    2 years ago
    Hi @Chris Hart could you elaborate a little on the design you have? I don’t think I understand the situation you’re running into. There are two ways of exchanging data between Prefect tasks: - returning an object from the task to be consumed as an input downstream - explicitly storing the object somewhere else that downstream tasks can access (you could also persist things on the Dask Cluster itself if you wanted) Storing any state on task objects will not work, unless that state was set at initialization of the task
    Chris Hart

    Chris Hart

    2 years ago
    @Chris White hey thanks for the reply.. yeah earlier I was wondering if, by virtue of running on dask (which already serializes things to distribute them), I could somehow dump the already-pickled stuff in transit and from that, pick out and dump to file an object (my model) that was already getting passed around between tasks.. just for efficiency curiosity.. but i'm just doing joblib.dump() while I have the python runtime copy and it's not a performance problem
    the OCD was beckoning but decided not to investigate the rabbit hole of picking off dask pickles in flight 🙂
    Chris White

    Chris White

    2 years ago
    haha i hear ya — Dask does has some primitives for hooking into what’s happening under the hood (I believe worker plugins and scheduler plugins are at least 2 things worth checking out) but I can’t promise that it will help your exact use case; glad it’s not critical either way!
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    While dask does send around objects, the serialized format is implementation specific and wouldn't be something I'd recommend for persisting. We do sometimes use cloudpickle, but for many common objects (numpy arrays, pandas dataframes, etc...) we have a faster internal option. Either way, the bytes sent from worker->worker->client aren't really persistable in a robust way, I recommend you handle persistence yourself.