Hello there, I don’t know exactly how to search fo...
# prefect-community
d
Hello there, I don’t know exactly how to search for this particular strange problem that I have encountered lately: I have a flow that is a simple pattern of generate N elements, map over them, then collect all results and send a slack notification. I executing this flow using a
DaskKubernetesEnvironment
and a kubernetes agent on a k8s GCP cluster that has auto-scaling (I am not sure if this is relevant). The strange situation is that, sometimes, the flow stays on running indefinetely because the last 1 to 3 tasks of the map are “still running”. I will try to collect some logs when this happens, but it does not look like the workers are stuck; it seems more like they were killed a bit too soon and they could not inform the prefect server (or the dask scheduler?) that they finished… Has anyone encountered this strange situation?
l
Interesting, yes I’m interested in the logs and maybe we should be interested in seeing the kubernetes master node logs in case there’s some kind of pod eviction thing going on? And in terms of environment, what version of prefect are you on?
d
I am on Prefect 0.10.5. I’ll keep an eye on this… and I will disable the ResourceManager just in case
l
Ok gotcha, let me know. There was a bug fixed related to Resource Manager in 0.10.6, but I think it would be the opposite of what you are seeing, that it wouldn’t clean up/evict.
d
Hi again @Laura Lorenz (she/her), I manage to capture the logs this time. I also upgraded to 0.10.7 and disabled the resource manager just in case. I am sharing the graphviz visualization of my graph for reference in this discussion. The problem appears on the second map. Normally, it should map over 399 elements generated by the FetchResponses task.
Here is an excerpt of the logs right before the flow fails.
Some pointers that I think may be interesting: • Resource manager was not enabled so it is no longer a suspect 🙂 • Dask is retiring workers and moving results. I am not sure why. Probably because it’s near the end of the execution and the workers don’t have anything else to do (there is never a problem with the previous map). Perhaps dask-kubernetes is too eager to clean them up? • I don’t think there are communication issues with the prefect server: the logs show that the CloudFlowRunner is aware that the tasks did not complete, so it leaves the status as RUNNING • Maybe it’s related that I do have
work_stealing=True
on my
DaskKubernetesEnvironment
? I will remove this and try again
So apparently what’s been happening is that I have dask workers that die unexpectedly. They don’t have time to tell the scheduler and the cloud runner tries to execute that task again; but the task status has not been changed to RUNNING, so when eventually the task is tried again, the work is not done because there is a state check (the log entry says
"Task 'xxx': task is already running"
)
Does that make sense?
A frustrating bit is that the logs of the second attempt rewrite the logs (on the prefect server) of the first attempt
l
Hi @David Ojeda sorry for the delay, I missed your ping there for a minute 🙂 I believe you are running your own Prefect server, not using Prefect Cloud based on your last message -- for the general class of tasks dying outside of Prefect’s control, Cloud has a polling service called Zombie Killer (https://docs.prefect.io/orchestration/concepts/services.html#zombie-killer) that tries to clean them up. There is a PR in progress to port that over to Prefect server but that service doesn’t exist there yet. For the logs part, I’m a little surprised about that because I thought the logs were append only, but I can double check that. I know that, depending on a lot of things, sometimes with dask/multithreading logs get intermittently ‘lost’ but it sounds like that’s not what you’re experiencing.
d
Hi Laura, thanks for the response. I’ve read about the Zombie killer on Cloud, which seems like a nice feature… On the meantime I still have to find the source of the problem: my workers should be dying like that. As a side note, I noted a couple of problems like the logs I mentioned, and another one that I did not mention before: the current k8s deployment for the Kubernetes agent has a small problem with its health liveness probe. I think there is a configuration issue since the http server works but kubernetes cannot connect to it. As a result; the agent is frequently killed or in a crash loop.