a

    Andrew Pruchinski

    2 months ago
    Hey all. Our flow isn't able to spin up more than 180 jobs. We can't determine the root cause exactly but at a certain point there's a connection aborted and prefect starts to remove jobs and then marks the entire flow as failed. We are using a daskexecutor on the cloud. Looking into the dask repos, we aren't able to set some debug level logging as an attempt to retrieve some info on what’s really causing the
    Connection aborted
    issue. Even if there are some failures, we do want the flow to continue and the task triggers are set accordingly. Thank you
    Bianca Hoch

    Bianca Hoch

    2 months ago
    Hello Andrew, would you mind sharing the flow id or an ID of an example flow run that was affected by this?
    a

    Andrew Pruchinski

    2 months ago
    Thank you @Bianca Hoch for the response and sorry for the delay
    Bianca Hoch

    Bianca Hoch

    2 months ago
    No worries, thanks Andrew!
    Hm. That's odd. I can't seem to find this flow run. Would you mind sending over another? Or perhaps the flow ID?
    Also, to clarify, could you describe the 180 jobs you are trying to run? (are they tasks, a series of subflows containing different tasks, etc?)
    Hopefully this will work!
    Yes so these are all mapped tasks for a machine learning model. We are experimenting over a bunch of different combinations of hyperparameters and models to determine the best one. The flow should spin up around 2800 tasks so we tried bumping up our max jobs to 1000. The resources are built to scale up as needed using dask and Karpenter
    Bianca Hoch

    Bianca Hoch

    2 months ago
    Alright, it looks like we have a few errors of interest in the logs:
    {"id":"1b8fe323-632b-4f49-966b-a2a82ba1de3a","name":"k8s-infra","timestamp":"2022-07-21T20:43:39.284953+00:00","message":"Pod prefect-job-a70dec73-mrhrw failed.\n\tContainer 'flow' state: terminated\n\t\tExit Code:: 137\n\t\tReason: OOMKilled"}
    {"id":"c5d8b50b-5eb6-486c-9266-07e5302a60ad","name":"prefect.CloudFlowRunner","timestamp":"2022-07-21T20:43:56.292591+00:00","message":"Flow run is no longer in a running state; the current state is: <Failed: \"Kubernetes Error: pods ['prefect-job-a70dec73-mrhrw'] failed for this job\">"}
    It looks like an OOM error due to lack of k8s resources
    The final Failed state of the flow was accompanied by the following message:
    "Kubernetes Error: pods ['prefect-job-a70dec73-mrhrw'] failed for this job*"*
    a

    Andrew Pruchinski

    2 months ago
    Perfect! Thank you for these resources. Ill be looking into them. Thank you for your help @Bianca Hoch
    Do you have any documentation or tips and tricks to handling OOM errors when using KubeCluster? This uses Coiled. Thank you @Bianca Hoch!
    Bianca Hoch

    Bianca Hoch

    1 month ago
    Hello Andrew, sorry for the late response. Currently we are a bit light on documentation in regards to handling OOM errors when using KubeCluster. Expect more to come on that going in to the future!
    a

    Andrew Pruchinski

    1 month ago
    Thank you!