Hello, I have problems with prefect cloud 2.0.We use kubernetes flow runner, and a dask task runner.Friday (8/7-2022), I had a flow run which I wanted to abort.I attempted to use the
functionality in the UI, thinking it would
delete all resources related to the flow_run, including the kubernetes job etc.It did not remove the kubernetes job, so I removed this manually.The issue is concurrency-limits: The tasks launched by this flow has a tag, with a concurrency limit.It appears the task data associated with the deleted flow run was not removed from prefect storage.
For instance, if I try:
prefect concurrency-limit inspect my-tag
It shows a bunch of active task ids, even though nothing is running in k8s.This causes an unfortunate issue where any new flow runs, for this flow, will never start tasks,
because prefect thinks the concurrency-limit is hit, due to these zombie tasks.However, I can not seem to find a way to manually clean up these task ids, which means this flow is dead.
Any help is appreciated!
Deleting a flow run will delete only the flow run, it will not terminate any external resourcesDue to a hybrid model, Prefect doesn't have direct access to your infra, which is why terminating resources this way is difficultLet me open an issue to investigating the best approach for such zombie tasks@Marvin open "Investigate the right approach for cleaning up zombie task runs caused by an infrastructure crash to free up concurrency limit slots"
to delete all related resources on the prefect-storage side. Such as any task runs associated with the flow-run etc.
Assuming this is not the case at the moment.Regarding external resources. We have our agent deployed in a k8s cluster, and the agent has access to the k8s api.
Would it not be possible to forward information from the agent, to the prefect-storage, and thus have it reflected in the UI?
We often have problems with the information in the UI being out of sync with the actual state in k8s. Such as flows which look like they "run-forever" even if the k8s pod is long gone.
2 months ago
Would you want to open a separate GitHub issue for that and explain there what is exactly happening that is out of sync between Kubernetes and Prefect? This is a separate issue than cleaning up zombie task runs, even if it's related to each other
2 months ago
Yes sure, I will do that
2 months ago
flows which look like they "run-forever" even if the k8s pod is long gone
as mentioned before, handling infrastructure crashes is a hard problem in a hybrid model and this is already on our radar. But if you mean something else, then creating a separate issue might be useful