Hi, I’m curious about how robust is Prefect operat...
# prefect-community
j
Hi, I’m curious about how robust is Prefect operation w/
Kubernetes Agent
,
KubernetesRun
and
DaskExecutor
in the face of various components outages. E.g. what will Prefect be able to recover on its own, and which case would require a manual intervention, if during a flow execution one of the following dies: prefect server (I suppose it’s ok as it’s stateless), kubernetes agent,
KubernetesRun
job executing a flow, dask scheduler, dask workers.
1
m
Hey @Jenia Varavva currently if the infrastructure your flow is running on crashes or experiences some sort of catastrophic failure there's not much we can do to recover the flow other than restarting it in Prefect 1.0, however here's a discussion around a similar question from another user, the short version is that crashed states and handling are on our roadmap for prefect 2.0 but not currently implemented.
j
I’ve done a few tests and the combination of zombie killer and lazarus seem to recover most of the cases
The one shortcoming is when the flow k8s job is terminated, it doesn’t call the Dask cluster context manager
__exit__
. In my case, as the dask cluster is “ephemeral” i.e. allocated from dask gateway for a flow, this results in an orphaned dask cluster
But the flows/tasks seem to eventually get retried unless I kill dask scheduler, in which case the flow gets into “Failed” state, which is probably ok
m
Brilliant 🙂 that's definitely good to hear.
m
@Mason Menges Now that Prefect 2.0 is out of beta I'm even more eager to hear if the same crash recovery features have now been implemented in the 2.0 codebase!?