Hi I have search the docs for an answer but could not find m Prefect Community #prefect-server

Hi ! I have search the docs for an answer but coul...

Sylvain Hazard

12/09/2021, 8:41 AM

Hi ! I have search the docs for an answer but could not find much so I thought I would ask here. How does the Prefect engine deal with submitted

KubernetesRun

based flows that remain

pending

for some reasons. For example, what happens if I try to submit a flow but there isn't enough resources available on my cluster at that moment ? From my experience, I can see that those flows get re-submitted and another pod is created after some time but what happens then ? Both pods will run if given the resources ? Is there a limit after which the engine kills the flow run because of being unable to run it properly ?

Anna Geller

12/09/2021, 9:58 AM

@Sylvain Hazard my understanding is that Lazarus is responsible for such use cases. It gracefully retries failures caused by factors outside of Prefect’s control - Kubernetes pods not spinning up due to resource constraints on a node is a great example of that. Once every 10 min, Lazarus searches for distressed flow runs and reschedules them (you could see this in the flow run’s logs). Scheduled flow runs without submitted or running task runs will be rescheduled up to 10 times - the 11th time the flow run is marked as failed.

Sylvain Hazard

12/09/2021, 10:00 AM

Thanks ! That is a pretty clean process, I like it !

🙌 1

Sylvain Hazard

12/09/2021, 10:01 AM

Weirdly enough, it looks like my Lazarus kills flow runs after only 3 retry attempts, is it something that's configurable ?

Sylvain Hazard

12/09/2021, 10:02 AM

I got something like this.

Anna Geller

12/09/2021, 10:02 AM

I see, let me check that

Sylvain Hazard

12/09/2021, 10:03 AM

Thanks

Anna Geller

12/09/2021, 10:04 AM

yup, looks like for Server 3 is actually the default - this is from config.toml:

Copy code

[services.lazarus]
    resurrection_attempt_limit = 3

Anna Geller

12/09/2021, 10:04 AM

the whole section in config.toml is called services:

Copy code

[services]

    host = "0.0.0.0"

    [services.apollo]
    host = "${services.host}"
    port = 4200

    [services.graphql]
    host = "${services.host}"
    port = 4201
    debug = false
    path = "/graphql/"
    disable_access_logs = false
    timeout_keep_alive = 5

    [services.lazarus]
    resurrection_attempt_limit = 3

    [services.towel]
    max_scheduled_runs_per_flow = 10

Sylvain Hazard

12/09/2021, 10:05 AM

Nice, thanks a lot !

3 Views

Open in Slack

Previous Next