Hi folks, we ran into an issue where one of our ch...
# prefect-community
m
Hi folks, we ran into an issue where one of our child tasks ran until Success then was restarted for some reason causing our flow to crash. Any idea why this happened ? and is there a way to prevent this ? We are running a kubernetes agent and a dask kubernetes execution environment on EKS...
happy to share a link to the task run's logs if this would help debug things
(it happened to two child tasks out of all the tasks in our flow)
So inspecting the kubernetes logs it shows that the same child task was re-run from a different pod id - which means the dask worker pod was restarted causing the task to restart
the pod that was stopped doesn't log any warnings or errors ...
is there a way to make dask worker restarts cause a flow failure ? this would at least ensure that successful tasks don't get re-run - then restarting the flow won't be problematic in our case given the data that was successfully processed won't be run another time
c
Hi Marwan - this is almost always caused by a dask worker crashing and dask deciding to recompute the task; the most robust solution for preventing these reruns is using Cloud’s version locking feature. Another workaround might be to register a worker plugin that fails the flow whenever a worker crahses
m
Hi Chris - thank you for the quick response, so we are on the TEAM plan currently. Does that include cloud version locking ?
seems like we don't have access to version locking given we are still on the TEAM plan ... i.e. we'd have to upgrade to enterprise ... Is there an implementation of the worker plugin that you know of and can share ?
c
Yea, that’s correct at the moment although we do have some changes in mind that will make version locking available on that tier - for that discussion I’d recommend emailing us. For the worker plugin, you’d need to reference the dask docs: https://distributed.dask.org/en/latest/plugins.html and create a Prefect Client in the transition logic that fails the appropriate flow run ID
m
Sorry for the late reply @Chris White - got sidetracked ... Thank you for shedding more light on this and sharing the dask scheduler plugin link. We'll probably send an email soon concerning the version locking feature
👍 1