Hey < Prefect> One of the persistent issues that we were fac Prefect Community #ask-community

Hey <@U021J8TU753>, One of the persistent issues ...

Rajeshwar Agrawal

11/17/2022, 1:27 PM

Hey @Prefect, One of the persistent issues that we were facing with prefect V1 was the unreliability around heartbeats and lazarus issues. This was promised to be fixed in v2, for which I recently did a quick evaluation of release 2.6.7 (community). Unfortunately, my initial investigation doesn’t seem to give great results. For example, any changes to k8s pod running a KubernetesJob type flow (such as pod restarts, pod terminated due k8s operations etc) do not show up as appropriate status updates on prefect flow. For instance, if a pod is terminated, the flow status remains Running indefinitely (flow timeouts wont solve the issue here, as our flow execution times follows the bell curve). I believe a combination of hearbeat and lazarus in v1 worked around this by marking flows which didn’t report back for a long time as crashed, which were retried by Lazarus. A similar functionality seems to be completely missing in v2. A github issue for this also exists https://github.com/PrefectHQ/prefect/issues/7371#issuecomment-1295604573 Is this something that’s already on the roadmap, which can expect to be released in near future? Are there any other flow execution patterns, that would work with k8s compute infrastructure, which solves the above mentioned problem? Here are some other observations that I found in v2, which are different from v1 1. It’s not possible to Cancel a flow run. We can delete it, but that does not terminate the underlying k8s job associated with the run. It seems that delete will only stop tracking the status of flow run. 2. The flows are no longer organised in projects. Previously, were able to do a separate deployment of a single flow into multiple projects, which allowed us to mimic multi-tenant capabilities, and manage different versions of same flow for different tenants. A similar functionality exists in v2, which allows multiple deployments for each flow, however this is less intuitive from multi-tenant perspective, as its more convenient to view and manage all flows of single project at once place, rather than manage different deployments of different tenants within the same flow 3. Date based filtering of logs of flow run is not supported. We need to keep scrolling down to reach the last line of log, this is not ideal for long-running flows with large log output 4. UI Issue - While working with Kubernetes Job Block, I observed that its not possible to unset any optional fields. 5. The time elapsed shown on UI remains at 0 secs unless flow reaches a terminal state

🎯 3

💅 1

Bianca Hoch

11/17/2022, 5:48 PM

Hey Rajeshwar, Thank you for writing up this summary! I’m going to do my best to answer your concerns and inquiries in the order that you’ve presented them. We are actively planning our approach on how to best handle unexpected infrastructure failures to functionally replace the Lazarus and Zombie Killers from 1.0. This is a high priority feature on our roadmap, and will be addressed in 2.0. For your additional observations: 1. Flow run

Cancellation

is on our roadmap and is also being actively worked on. It is expected to be released around the same time as the ZK and Lazarus replacement (both taking priority to other features on the roadmap) 2. In regard to projects, in 2.0 we have all the same tools for project-type organization now (tags & workspace). The difference is that this method of organization is non-liner, verses projects which are strict hierarchies. 3. Time based sorting for logs is anticipated to be present in our next release. For having log filtering as a feature, I’d suggest creating a feature request in GitHub. It certainly sounds like a good idea, and we’d be more than happy to review the request. 4. For the UI issue, we believe your comment reflects this open issue. This is something that is currently being worked on, so expect a resolution shortly. 5. We can confirm this is a bug and we will be implementing a fix asap. Thanks for raising it here.

👍 1

3 Views

Open in Slack

Previous Next