Hey all. Our flow isn't able to spin up more than ...
# prefect-community
a
Hey all. Our flow isn't able to spin up more than 180 jobs. We can't determine the root cause exactly but at a certain point there's a connection aborted and prefect starts to remove jobs and then marks the entire flow as failed. We are using a daskexecutor on the cloud. Looking into the dask repos, we aren't able to set some debug level logging as an attempt to retrieve some info on what’s really causing the
Connection aborted
issue. Even if there are some failures, we do want the flow to continue and the task triggers are set accordingly. Thank you
1
b
Hello Andrew, would you mind sharing the flow id or an ID of an example flow run that was affected by this?
a
Thank you @Bianca Hoch for the response and sorry for the delay
b
No worries, thanks Andrew!
Hm. That's odd. I can't seem to find this flow run. Would you mind sending over another? Or perhaps the flow ID?
Also, to clarify, could you describe the 180 jobs you are trying to run? (are they tasks, a series of subflows containing different tasks, etc?)
Hopefully this will work!
Yes so these are all mapped tasks for a machine learning model. We are experimenting over a bunch of different combinations of hyperparameters and models to determine the best one. The flow should spin up around 2800 tasks so we tried bumping up our max jobs to 1000. The resources are built to scale up as needed using dask and Karpenter
👍 1
b
Alright, it looks like we have a few errors of interest in the logs:
{"id":"1b8fe323-632b-4f49-966b-a2a82ba1de3a","name":"k8s-infra","timestamp":"2022-07-21T20:43:39.284953+00:00","message":"Pod prefect-job-a70dec73-mrhrw failed.\n\tContainer 'flow' state: terminated\n\t\tExit Code:: 137\n\t\tReason: OOMKilled"}
{"id":"c5d8b50b-5eb6-486c-9266-07e5302a60ad","name":"prefect.CloudFlowRunner","timestamp":"2022-07-21T20:43:56.292591+00:00","message":"Flow run is no longer in a running state; the current state is: <Failed: \"Kubernetes Error: pods ['prefect-job-a70dec73-mrhrw'] failed for this job\">"}
It looks like an OOM error due to lack of k8s resources
The final Failed state of the flow was accompanied by the following message:
"Kubernetes Error: pods ['prefect-job-a70dec73-mrhrw'] failed for this job*"*
a
Perfect! Thank you for these resources. Ill be looking into them. Thank you for your help @Bianca Hoch
🎉 1
Do you have any documentation or tips and tricks to handling OOM errors when using KubeCluster? This uses Coiled. Thank you @Bianca Hoch!
b
Hello Andrew, sorry for the late response. Currently we are a bit light on documentation in regards to handling OOM errors when using KubeCluster. Expect more to come on that going in to the future!
a
Thank you!