Has anyone run into this issue before Running `DaskExecutor` Prefect Community #ask-community

Has anyone run into this issue before? Running `Da...

Erik Amundson

07/19/2021, 10:17 PM

Has anyone run into this issue before? Running

DaskExecutor

on GKE with cluster class

dask_kubernetes.KubeCluster

and it seems to be dropping 1-2 mapped children per run. It's like the scheduler doesn't realize they exist, or is losing track somehow - there is no error message in the logs. This prevents the flow from proceeding to the downstream tasks, so I end up having to cancel the flow. So far it's shown the same behavior on all four test runs. If it matters, we're running on prefect 0.14.16.

Kevin Kho

07/20/2021, 4:16 AM

Hey @Erik Amundson, I haven’t seen this before, is there anything you can gather from the Dask dashboard?

Erik Amundson

07/20/2021, 4:30 PM

@Kevin Kho The tasks are actually registered on the Dask dashboard, they're just stuck on "processing". Some of them are very early on, like this worker that's been stuck on processing since its third mapped task (out of 1500 with 10 workers - the earlier tasks in the screenshot are all dependencies). Also, dask doesn't seem to be releasing any of the old, unmanaged memory but I don't know if that's related.

Erik Amundson

07/20/2021, 4:43 PM

Those tasks don't show up in the prefect logs, but if I kill the worker pod they're on all tasks from that worker including the "missing" one will be re-run on other workers.

Erik Amundson

07/20/2021, 5:00 PM

Some of those other tasks then get "stuck" in the same way though, so that's not really a working solution unfortunately.

Erik Amundson

07/20/2021, 5:13 PM

Sorry about all the replies, this is the pod-level log message that I see

Copy code

2021-07-20T16:57:32.459764320Zdistributed.core - INFO - Event loop was unresponsive in Worker for 10.01s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Kevin Kho

07/20/2021, 5:16 PM

Oh no worries sorry I haven’t gotten back to use yet. This issues are harder 😅. This looks like a Dask issue more than Prefect to be honest, but I find it so weird that it’s exactly 2 tasks that fail each time? I also assume the 1500 task runs was a subset of the 5000, so this is not a resource issue right? Could you tell me more about what the tasks are doing?

Erik Amundson

07/20/2021, 5:18 PM

The 1500 was a second run of the flow - we have a parameter for how many to run. It's not exactly two every time, it's been between one and four. Each mapped child is loading an excel file and using it to create six documents, which then get collected at the end of the flow and upserted.

Kevin Kho

07/20/2021, 5:21 PM

For those failed tasks, would you know if the documents created? Or does it hang on the upsert?

Erik Amundson

07/20/2021, 5:24 PM

The upsert happens downstream of the map in the flow, so that's definitely not the issue. We wouldn't be able to tell if the documents are created unless we put in a webhook or something, since the failed tasks don't show up in the prefect logs or the kubernetes pod logs.

Kevin Kho

07/20/2021, 5:26 PM

I see. Based on this , I suppose it could be related to memory issues? There is this

video▾

that shows a setting you can try to lower the unmanaged memory usage. Might be worth a shot?

Kevin Kho

07/20/2021, 5:28 PM

Also related. It looks like this happens when having many futures, it might just be a warning, but the recommendation is to break things up if possible?

Erik Amundson

07/20/2021, 5:30 PM

I'll check those out, thanks!

8 Views

Open in Slack

Previous Next