Blake Stefansen
06/27/2023, 4:03 PMpod_watch_timeout_seconds
https://docs.prefect.io/2.10.17/api-ref/prefect/infrastructure/?h=pod+watch+timeout#prefect.infrastructure.KubernetesJob
The attribute is described as Number of seconds to watch for pod creation before timing out (default 60).
My team has a concurrency limit on our queue of 10 jobs, and most of these jobs take around 30 seconds. Notice in the image below how job number 11 is flagged late
, which eventually causes the agent to flag as crashed
. However, the crashed
job will eventually start running and become complete
, even though the agent stops logging the job.
15:25:20.062 | ERROR | prefect.infrastructure.kubernetes-job - Job 'file5-sf-fx-locations-foobar-maxdown-csv-rs4bz': Pod never started.
15:25:20.213 | INFO | prefect.agent - Reported flow run '18f20756-0731-4f2a-8395-61e9ab755dfd' as crashed: Flow run infrastructure exited with non-zero status code -1.
QUESTIONS
1. What triggers the timer countdown? Like, does the 60 second timer start counting down once the job leaves the queue and is picked up by the agent?
2. What happens in a scenario where there are 1000 jobs added to the queue? Will I get a bunch of crashes? ( i'm assuming not because the agent wouldn't pick up more than 10 jobs due to the concurrency limit )
3. If "job 11" is picked up by the agent, that means it took the place of the previous completed job, so I would think that the pod would get created almost immediately ( at least within 60 sec ). I guess i'm not sure why the job pod is not getting created within 60 seconds if the agent is picking it up.60
sec use 600
sec?Kevin Grismore
06/27/2023, 6:25 PMChristopher Boyd
06/27/2023, 6:47 PMKevin Grismore
06/27/2023, 6:49 PMTom Klein
08/29/2023, 10:31 AM