c

    Chris

    3 years ago
    Hi everyone, has anyone had any issues with leaked semaphores when running a flow with the Dask Executor? It looks like somewhere during the flow, one of the Dask workers fails (might be some connection/timeout issue). Just before it fails I get a
    there appear to be 1 leaked semaphores to clean up at shutdown
    . Also, rather than killing the flow, or restarting the failed task, it restarts upstream tasks which were successfully run. Does anyone know how what the cause of failure might be and whether it’s possible to only restart the failed task?
    Jeremiah

    Jeremiah

    3 years ago
    Hi @Chris, funny enough we have actually been debugging a related issue ourselves this week. I can’t speak to the
    semaphore
    warning directly, but the behavior — in which a Dask worker’s failure prompts Dask to rerun upstream functions — is one that we now have a replicable example for and are considering raising an issue in the Distributed repo. It appears that this is a brief moment of assumed idempotency in the Distributed scheduler — all other Dask retry logic can be disabled via config, but not this one. To be clear, I don’t think this is a Dask bug, per se, but I do think it needs a way to turn off the behavior.
    One thing we’re exploring is the remedy described in this issue, but I”m not yet sure what side effects that has.
    c

    Chris

    3 years ago
    @Jeremiah Thanks for the quick response, I’m glad I’m not alone in dealing with this issue 😄 I did try to tweak some dask config params to remove allowed failures (which resulted in the entire flow being killed) but didn’t have any luck in finding settings specific to the task restarting without affecting other downstream tasks. I’ll be sure to follow this issue closely!
    Jeremiah

    Jeremiah

    3 years ago
    Yup, allowed failures and also retries don’t appear to affect this — @Chris White and I had a few late nights this week on it!
    😅
    I should note for any Prefect Cloud users that this doesn’t affect execution due to Cloud’s state-locking mechanisms, but it’s just a general frustration for non-idempotent workflows in Dask in general!
    c

    Chris

    3 years ago
    It’s given me a few headaches too over the past couple of days, fingers crossed one of us finds a solution! 😄