Hey all Our flow isn t able to spin up more than 180 jobs We Prefect Community #ask-community

Hey all. Our flow isn't able to spin up more than ...

Andrew Pruchinski

07/25/2022, 3:34 PM

Hey all. Our flow isn't able to spin up more than 180 jobs. We can't determine the root cause exactly but at a certain point there's a connection aborted and prefect starts to remove jobs and then marks the entire flow as failed. We are using a daskexecutor on the cloud. Looking into the dask repos, we aren't able to set some debug level logging as an attempt to retrieve some info on what’s really causing the

Connection aborted

issue. Even if there are some failures, we do want the flow to continue and the task triggers are set accordingly. Thank you

✅ 1

Bianca Hoch

07/25/2022, 10:40 PM

Hello Andrew, would you mind sharing the flow id or an ID of an example flow run that was affected by this?

Andrew Pruchinski

07/26/2022, 1:28 PM

Thank you @Bianca Hoch for the response and sorry for the delay

Andrew Pruchinski

07/26/2022, 1:28 PM

i believe this is an example of one: https://cloud.prefect.io/flow-run/2bac3da9-a43e-4d4b-80df-3739b0440a88

Bianca Hoch

07/26/2022, 1:49 PM

No worries, thanks Andrew!

Bianca Hoch

07/26/2022, 1:55 PM

Hm. That's odd. I can't seem to find this flow run. Would you mind sending over another? Or perhaps the flow ID?

Bianca Hoch

07/26/2022, 1:56 PM

Also, to clarify, could you describe the 180 jobs you are trying to run? (are they tasks, a series of subflows containing different tasks, etc?)

Andrew Pruchinski

07/26/2022, 1:58 PM

https://cloud.prefect.io/flow-run/063f4538-6660-45c6-bf0f-9ba23fbab336

⭐ 1

Andrew Pruchinski

07/26/2022, 1:58 PM

Hopefully this will work!

Andrew Pruchinski

07/26/2022, 2:04 PM

Yes so these are all mapped tasks for a machine learning model. We are experimenting over a bunch of different combinations of hyperparameters and models to determine the best one. The flow should spin up around 2800 tasks so we tried bumping up our max jobs to 1000. The resources are built to scale up as needed using dask and Karpenter

👍 1

Bianca Hoch

07/26/2022, 2:22 PM

Alright, it looks like we have a few errors of interest in the logs:

Bianca Hoch

07/26/2022, 2:23 PM

{"id":"1b8fe323-632b-4f49-966b-a2a82ba1de3a","name":"k8s-infra","timestamp":"2022-07-21T20:43:39.284953+00:00","message":"Pod prefect-job-a70dec73-mrhrw failed.\n\tContainer 'flow' state: terminated\n\t\tExit Code:: 137\n\t\tReason: OOMKilled"}

Bianca Hoch

07/26/2022, 2:24 PM

{"id":"c5d8b50b-5eb6-486c-9266-07e5302a60ad","name":"prefect.CloudFlowRunner","timestamp":"2022-07-21T20:43:56.292591+00:00","message":"Flow run is no longer in a running state; the current state is: <Failed: \"Kubernetes Error: pods ['prefect-job-a70dec73-mrhrw'] failed for this job\">"}

Bianca Hoch

07/26/2022, 2:26 PM

It looks like an OOM error due to lack of k8s resources

Bianca Hoch

07/26/2022, 2:29 PM

The final Failed state of the flow was accompanied by the following message:

"Kubernetes Error: pods ['prefect-job-a70dec73-mrhrw'] failed for this job*"*

Bianca Hoch

07/26/2022, 2:33 PM

Here's some documentation with a few suggestions on how to handle OOM errors: https://discourse.prefect.io/t/how-to-allocate-more-memory-or-more-worker-nodes-on-a-per-flow-run-basis/130 https://discourse.prefect.io/t/how-to-ensure-that-my-flow-run-kubernetes-job-has-enough-memory/320

Andrew Pruchinski

07/26/2022, 5:03 PM

Perfect! Thank you for these resources. Ill be looking into them. Thank you for your help @Bianca Hoch

🎉 1

Andrew Pruchinski

07/27/2022, 3:57 PM

Do you have any documentation or tips and tricks to handling OOM errors when using KubeCluster? This uses Coiled. Thank you @Bianca Hoch!

Bianca Hoch

07/28/2022, 1:38 PM

Hello Andrew, sorry for the late response. Currently we are a bit light on documentation in regards to handling OOM errors when using KubeCluster. Expect more to come on that going in to the future!

Andrew Pruchinski

07/28/2022, 6:45 PM

Thank you!

4 Views

Open in Slack

Previous Next