Blake Stefansen06/27/2023, 4:03 PM
https://docs.prefect.io/2.10.17/api-ref/prefect/infrastructure/?h=pod+watch+timeout#prefect.infrastructure.KubernetesJob The attribute is described as
My team has a concurrency limit on our queue of 10 jobs, and most of these jobs take around 30 seconds. Notice in the image below how job number 11 is flagged
Number of seconds to watch for pod creation before timing out (default 60).
, which eventually causes the agent to flag as
. However, the
job will eventually start running and become
, even though the agent stops logging the job.
QUESTIONS 1. What triggers the timer countdown? Like, does the 60 second timer start counting down once the job leaves the queue and is picked up by the agent? 2. What happens in a scenario where there are 1000 jobs added to the queue? Will I get a bunch of crashes? ( i'm assuming not because the agent wouldn't pick up more than 10 jobs due to the concurrency limit ) 3. If "job 11" is picked up by the agent, that means it took the place of the previous completed job, so I would think that the pod would get created almost immediately ( at least within 60 sec ). I guess i'm not sure why the job pod is not getting created within 60 seconds if the agent is picking it up.
15:25:20.062 | ERROR | prefect.infrastructure.kubernetes-job - Job 'file5-sf-fx-locations-foobar-maxdown-csv-rs4bz': Pod never started. 15:25:20.213 | INFO | prefect.agent - Reported flow run '18f20756-0731-4f2a-8395-61e9ab755dfd' as crashed: Flow run infrastructure exited with non-zero status code -1.
Kevin Grismore06/27/2023, 6:25 PM
Christopher Boyd06/27/2023, 6:47 PM
Kevin Grismore06/27/2023, 6:49 PM
Tom Klein08/29/2023, 10:31 AM