Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

hey everyone - I am using EKS with auto-scaling and `DaskKubernetesEnvironment` to run my workflows. There will be times where I will need to submit a high volume of flows to the cluster. My expectation is that the auto-scaler will spin up the additional EC2 instances required to handle to temporary volume increase but I am concerned about the latency (that is the time it takes to make the additional resources available). Is there a setting on the retry the entire flow if it fails? That way it reduces the gap.

Hi <@UL39052F5> - Thanks for the question. Let me look into that for you.

Hi John, thinking this through - is there a specific reason behind allowing the flow-run to fail?  Could you use a task to check resources etc and  allow that to re-try until they're available?  Or is there something I'm missing there?

well i want to prevent a flow from failing. say I have two flow submitted 5 sec apart and there is only enough resources in the EKS cluster to support one and it take 3 mins for the auto-scaler to spin up another server. now what idk is will the second task start not have the resources to run and fail

Hi <@UL39052F5> , a few suggestions for your depending on how you want to design your flow and where you are concerned about failures.
• If you're concerned about K8s failing the job because it doesn't have the correct resources, you should be ok as the job will sit in the queue until a new node spins up
• if you're concerned that the dask cluster will be too small to successfully handle the job, you could set your environment to scale to a set number of workers and then block until it has the necessary resources with something like:
```from dask.distributed import get_client

@task
def wait_for_resources():
    client = get_client()
    # Wait until you have 10 workers
    client.wait_for_workers(n_workers=10) ```
•  Or if you're concerned only about stopping the flow run from failing, you could use a root task that waits for resources to become available before finishing - if you set your other tasks as dependent e.g. by using a <https://docs.prefect.io/core/concepts/tasks.html#overview|trigger>, they won't run until it's finished
Also worth noting that flow-run-concurrency limiting is a roadmap item so you should be able to configure this natively at some point in the future.