Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi- Does anyone know how to diagnose a k8s job failure with no error messages? After I submitted a job,  the 300 worker pods successfully connect to the scheduler and the jobs are set to running in the UI. However, after 8-12 minutes the pod running the scheduler/flow-runner exits (and is resubmitted because the job is still active). The cloud-ui shows still shows the job as running. Thanks for any advice!

I think I found out why my jobs/pods were exiting. In my custom job_spec I changed the cpu requests but did not specify a memory request . This likely allowed the k8s cluster to evict the job. After specifying the memory requests to 8gb the scheduler/flow-runner persist past 10 minutes

Out of interest, how do you get the flow to run in 300 worker pods with a Kubernetes Job? Is this without Dask?

This is with Dask, K8s cluster -&gt; KubeCluster  (<https://docs.prefect.io/orchestration/flow_config/upgrade.html#daskkubernetesenvironment>)