https://prefect.io logo
Title
s

Sylvain Hazard

12/09/2021, 8:41 AM
Hi ! I have search the docs for an answer but could not find much so I thought I would ask here. How does the Prefect engine deal with submitted
KubernetesRun
based flows that remain
pending
for some reasons. For example, what happens if I try to submit a flow but there isn't enough resources available on my cluster at that moment ? From my experience, I can see that those flows get re-submitted and another pod is created after some time but what happens then ? Both pods will run if given the resources ? Is there a limit after which the engine kills the flow run because of being unable to run it properly ?
a

Anna Geller

12/09/2021, 9:58 AM
@Sylvain Hazard my understanding is that Lazarus is responsible for such use cases. It gracefully retries failures caused by factors outside of Prefect’s control - Kubernetes pods not spinning up due to resource constraints on a node is a great example of that. Once every 10 min, Lazarus searches for distressed flow runs and reschedules them (you could see this in the flow run’s logs). Scheduled flow runs without submitted or running task runs will be rescheduled up to 10 times - the 11th time the flow run is marked as failed.
s

Sylvain Hazard

12/09/2021, 10:00 AM
Thanks ! That is a pretty clean process, I like it !
🙌 1
Weirdly enough, it looks like my Lazarus kills flow runs after only 3 retry attempts, is it something that's configurable ?
I got something like this.
a

Anna Geller

12/09/2021, 10:02 AM
I see, let me check that
s

Sylvain Hazard

12/09/2021, 10:03 AM
Thanks
a

Anna Geller

12/09/2021, 10:04 AM
yup, looks like for Server 3 is actually the default - this is from config.toml:
[services.lazarus]
    resurrection_attempt_limit = 3
the whole section in config.toml is called services:
[services]

    host = "0.0.0.0"

    [services.apollo]
    host = "${services.host}"
    port = 4200

    [services.graphql]
    host = "${services.host}"
    port = 4201
    debug = false
    path = "/graphql/"
    disable_access_logs = false
    timeout_keep_alive = 5

    [services.lazarus]
    resurrection_attempt_limit = 3

    [services.towel]
    max_scheduled_runs_per_flow = 10
s

Sylvain Hazard

12/09/2021, 10:05 AM
Nice, thanks a lot !