https://prefect.io logo
b

Blake Stefansen

06/27/2023, 4:03 PM
Hi Everyone, CONTEXT Regarding the KubernetesJob infrastructure and its use of the
pod_watch_timeout_seconds
https://docs.prefect.io/2.10.17/api-ref/prefect/infrastructure/?h=pod+watch+timeout#prefect.infrastructure.KubernetesJob The attribute is described as
Number of seconds to watch for pod creation before timing out (default 60).
My team has a concurrency limit on our queue of 10 jobs, and most of these jobs take around 30 seconds. Notice in the image below how job number 11 is flagged
late
, which eventually causes the agent to flag as
crashed
. However, the
crashed
job will eventually start running and become
complete
, even though the agent stops logging the job.
Copy code
15:25:20.062 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'file5-sf-fx-locations-foobar-maxdown-csv-rs4bz': Pod never started.
15:25:20.213 | INFO    | prefect.agent - Reported flow run '18f20756-0731-4f2a-8395-61e9ab755dfd' as crashed: Flow run infrastructure exited with non-zero status code -1.
QUESTIONS 1. What triggers the timer countdown? Like, does the 60 second timer start counting down once the job leaves the queue and is picked up by the agent? 2. What happens in a scenario where there are 1000 jobs added to the queue? Will I get a bunch of crashes? ( i'm assuming not because the agent wouldn't pick up more than 10 jobs due to the concurrency limit ) 3. If "job 11" is picked up by the agent, that means it took the place of the previous completed job, so I would think that the pod would get created almost immediately ( at least within 60 sec ). I guess i'm not sure why the job pod is not getting created within 60 seconds if the agent is picking it up.
I've looked at our k8 cluster events, and it looks our 3 worker nodes are at max capacity, so i'm thinking thats probably the issue? I don't think I can scale our worker nodes for cost reasons, so I guess the best thing to do is have a less conservative timeout limit? Instead of
60
sec use
600
sec?
k

Kevin Grismore

06/27/2023, 6:25 PM
I think it may be a combination of factors. I have definitely upped the pod creation time limit before. If the pod TTL is 60 seconds and your node pool is at capacity, pod creation may not happen until a little while after the last completed job is cleaned up, so make sure the creation time limit is longer than the TTL. You can also modify the prefetch time to have your agent/worker wait to submit the flow run for longer than the default, but I would try to work this out purely with k8s job settings first https://docs.prefect.io/2.10.17/concepts/work-pools/?h=prefetch#configuring-prefetch
c

Christopher Boyd

06/27/2023, 6:47 PM
if a job is in a pending state, it’s waiting for resource to schedule onto a node, the timer is ticking
1
so if you are at max resources and the pod can’t be scheduled onto the node in time, then yes, it’s marked as crashed
60 seconds is a bit low i’d say, but ymmv - I’ve set these to around 300 seconds for my own, to account for image pull + node auto scaling
🙌 1
k

Kevin Grismore

06/27/2023, 6:49 PM
that's the same number I chose when this was happening to me
🙌 1
2
t

Tom Klein

08/29/2023, 10:31 AM
We encountered exactly the same thing - we see crashes and then “magical” runs and completions. We also happened to increase it for other jobs (60 -> 600) but - this specific job (that requests 64GB of ram) - did not enjoy this treatment. so - it’s pretty clear i think WHY this is happening, what i don’t understand but would like to understand, is HOW or WHY does Prefect detect that the job actually didn’t fail at all (nor did the pod fail to create) but actually just took a long time? (we see “1" on the run count as opposed to “0” meaning that Prefect at least recognizes that this was “Attempted” or at least “inspected” more than once) also, why does the agent stop logging anything about it from that point on? i.e. what i don’t like is this silent unpredictable voodoo 🙂