I am running a flow via kubernetes job. Sometimes when there are not enough nodes available to run t...

Zachary Lee

02/08/2023, 5:41 PM

I am running a flow via kubernetes job. Sometimes when there are not enough nodes available to run the job pods, it will take a sec to scale up a new node. During this time, prefect seems to mark the flow as crashed (since it was not scheduled after some time), but eventually the new node does come up and the flow is able to run fine. However, prefect refuses to run the flow since since the run has already been marked as terminated:

aborted by orchestrator: This run has already terminated.

Is there some way I can configure the internal timeout for waiting for the pod to be scheduled? Configuring retries does not seem to make a difference. Thanks!

✅ 1

Timo Vink

02/08/2023, 6:30 PM

Most likely I believe you're looking for

pod_watch_timeout_seconds

on the

KubernetesJob

, which is how long a Pod gets to go from

Pending

to any other state before the flow gets marked as crashed. Defaults to 60 seconds. https://docs.prefect.io/api-ref/prefect/infrastructure/#prefect.infrastructure.KubernetesJob

✅ 1

🙌 1

Zachary Lee

02/08/2023, 6:31 PM

Thank you so much! I will try this shortly!

Zachary Lee

02/08/2023, 7:05 PM

It worked! thanks! 🙂

🙌 2

480 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.