Hello! I’m running a pretty extensive pipeline on a large EC2 machine (we’re still trying to get setup to port it to prefect cloud), and have some bottleneck tasks in the middle that collect the results of a bunch of previous mapped tasks and do some calculations across the whole dataset. The bottleneck tasks don’t manage to succeed, but they don’t explicitly “fail”, either, earlier tasks just re-run (collecting the checkpointed results) and then the bottleneck task starts again, but again does not finish… I can’t figure out what the problem is, I can run through the task outside of prefect if I load all the data, so I’m not sure if it’s a silent memory issue on the worker running the large task, or some time-out issue with the scheduler, or something else I’m not thinking of. Has anyone else run into similar behavior and have a suggestion for what to do to diagnose and/or fix the problem?
I will test what happens when I go down to a single worker, maybe it’ll get through the task… but it would be convenient to not need to split these tasks out into separate pipelines that are run with fewer workings.
03/31/2022, 9:11 PM
Sounds like a weird scenario. I can’t really tell from the description. Could you mention more about “don’t explicitly fail?“?
OOM errors are pretty explicit and Prefect doesn’t have any default timeout
03/31/2022, 9:43 PM
That’s what I thought. I’m not sure what else to say, unfortunately, because prefect doesn’t say anything, it just starts rescheduling upstream tasks, which complete with the cached results, and restarts the task that it never gets past.
I’ll try to do a bit more on my end to figure out more details of what’s going on, but thought I would post here in case there was an obvious known issue with the local dask executor for these sorts of situations.
It’s also sort of a non-standard scenario: I actually implemented a more elaborate version of something like what’s described in this post, where results are saved during the task and a results path container is returned and ingested downstream, because I was having failures that seemed to be due to OOM events.
(So task checkpointing is turned off, with the data checkpointing/loading handled by my custom task wrappers.)
04/01/2022, 12:04 AM
I think it sounds like Dask workers are dying causing Dask to spin new workers and restart work. And Dask is retrying the work added to Prefect retries?