Hi- Does anyone know how to diagnose a k8s job failure with no error messages? After I submitted a job, the 300 worker pods successfully connect to the scheduler and the jobs are set to running in the UI. However, after 8-12 minutes the pod running the scheduler/flow-runner exits (and is resubmitted because the job is still active). The cloud-ui shows still shows the job as running. Thanks for any advice!
I think I found out why my jobs/pods were exiting. In my custom job_spec I changed the cpu requests but did not specify a memory request . This likely allowed the k8s cluster to evict the job. After specifying the memory requests to 8gb the scheduler/flow-runner persist past 10 minutes
01/13/2021, 10:33 AM
Out of interest, how do you get the flow to run in 300 worker pods with a Kubernetes Job? Is this without Dask?