hey everyone - I am using EKS with auto-scaling an...
# prefect-community
j
hey everyone - I am using EKS with auto-scaling and
DaskKubernetesEnvironment
to run my workflows. There will be times where I will need to submit a high volume of flows to the cluster. My expectation is that the auto-scaler will spin up the additional EC2 instances required to handle to temporary volume increase but I am concerned about the latency (that is the time it takes to make the additional resources available). Is there a setting on the retry the entire flow if it fails? That way it reduces the gap.
j
Hi @John Ramirez - Thanks for the question. Let me look into that for you.
Hi John, thinking this through - is there a specific reason behind allowing the flow-run to fail? Could you use a task to check resources etc and allow that to re-try until they're available? Or is there something I'm missing there?
j
well i want to prevent a flow from failing. say I have two flow submitted 5 sec apart and there is only enough resources in the EKS cluster to support one and it take 3 mins for the auto-scaler to spin up another server. now what idk is will the second task start not have the resources to run and fail
j
Hi @John Ramirez , a few suggestions for your depending on how you want to design your flow and where you are concerned about failures. • If you're concerned about K8s failing the job because it doesn't have the correct resources, you should be ok as the job will sit in the queue until a new node spins up • if you're concerned that the dask cluster will be too small to successfully handle the job, you could set your environment to scale to a set number of workers and then block until it has the necessary resources with something like:
Copy code
from dask.distributed import get_client

@task
def wait_for_resources():
    client = get_client()
    # Wait until you have 10 workers
    client.wait_for_workers(n_workers=10)
• Or if you're concerned only about stopping the flow run from failing, you could use a root task that waits for resources to become available before finishing - if you set your other tasks as dependent e.g. by using a trigger, they won't run until it's finished Also worth noting that flow-run-concurrency limiting is a roadmap item so you should be able to configure this natively at some point in the future.