Has anyone run into this issue before? Running `Da...
# ask-community
e
Has anyone run into this issue before? Running
DaskExecutor
on GKE with cluster class
dask_kubernetes.KubeCluster
and it seems to be dropping 1-2 mapped children per run. It's like the scheduler doesn't realize they exist, or is losing track somehow - there is no error message in the logs. This prevents the flow from proceeding to the downstream tasks, so I end up having to cancel the flow. So far it's shown the same behavior on all four test runs. If it matters, we're running on prefect 0.14.16.
k
Hey @Erik Amundson, I haven’t seen this before, is there anything you can gather from the Dask dashboard?
e
@Kevin Kho The tasks are actually registered on the Dask dashboard, they're just stuck on "processing". Some of them are very early on, like this worker that's been stuck on processing since its third mapped task (out of 1500 with 10 workers - the earlier tasks in the screenshot are all dependencies). Also, dask doesn't seem to be releasing any of the old, unmanaged memory but I don't know if that's related.
Those tasks don't show up in the prefect logs, but if I kill the worker pod they're on all tasks from that worker including the "missing" one will be re-run on other workers.
Some of those other tasks then get "stuck" in the same way though, so that's not really a working solution unfortunately.
Sorry about all the replies, this is the pod-level log message that I see
Copy code
2021-07-20T16:57:32.459764320Zdistributed.core - INFO - Event loop was unresponsive in Worker for 10.01s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
k
Oh no worries sorry I haven’t gotten back to use yet. This issues are harder 😅. This looks like a Dask issue more than Prefect to be honest, but I find it so weird that it’s exactly 2 tasks that fail each time? I also assume the 1500 task runs was a subset of the 5000, so this is not a resource issue right? Could you tell me more about what the tasks are doing?
e
The 1500 was a second run of the flow - we have a parameter for how many to run. It's not exactly two every time, it's been between one and four. Each mapped child is loading an excel file and using it to create six documents, which then get collected at the end of the flow and upserted.
k
For those failed tasks, would you know if the documents created? Or does it hang on the upsert?
e
The upsert happens downstream of the map in the flow, so that's definitely not the issue. We wouldn't be able to tell if the documents are created unless we put in a webhook or something, since the failed tasks don't show up in the prefect logs or the kubernetes pod logs.
k
I see. Based on this , I suppose it could be related to memory issues? There is this

video

that shows a setting you can try to lower the unmanaged memory usage. Might be worth a shot?
Also related. It looks like this happens when having many futures, it might just be a warning, but the recommendation is to break things up if possible?
e
I'll check those out, thanks!