Hey
@Prefect,
One of the persistent issues that we were facing with prefect V1 was the unreliability around heartbeats and lazarus issues. This was promised to be fixed in v2, for which I recently did a quick evaluation of release 2.6.7 (community). Unfortunately, my initial investigation doesn’t seem to give great results. For example, any changes to k8s pod running a KubernetesJob type flow (such as pod restarts, pod terminated due k8s operations etc) do not show up as appropriate status updates on prefect flow. For instance, if a pod is terminated, the flow status remains Running indefinitely (flow timeouts wont solve the issue here, as our flow execution times follows the bell curve). I believe a combination of hearbeat and lazarus in v1 worked around this by marking flows which didn’t report back for a long time as crashed, which were retried by Lazarus. A similar functionality seems to be completely missing in v2. A github issue for this also exists
https://github.com/PrefectHQ/prefect/issues/7371#issuecomment-1295604573
Is this something that’s already on the roadmap, which can expect to be released in near future? Are there any other flow execution patterns, that would work with k8s compute infrastructure, which solves the above mentioned problem?
Here are some other observations that I found in v2, which are different from v1
1. It’s not possible to Cancel a flow run. We can delete it, but that does not terminate the underlying k8s job associated with the run. It seems that delete will only stop tracking the status of flow run.
2. The flows are no longer organised in projects. Previously, were able to do a separate deployment of a single flow into multiple projects, which allowed us to mimic multi-tenant capabilities, and manage different versions of same flow for different tenants. A similar functionality exists in v2, which allows multiple deployments for each flow, however this is less intuitive from multi-tenant perspective, as its more convenient to view and manage all flows of single project at once place, rather than manage different deployments of different tenants within the same flow
3. Date based filtering of logs of flow run is not supported. We need to keep scrolling down to reach the last line of log, this is not ideal for long-running flows with large log output
4. UI Issue - While working with Kubernetes Job Block, I observed that its not possible to unset any optional fields.
5. The time elapsed shown on UI remains at 0 secs unless flow reaches a terminal state