j

    John Ramirez

    2 years ago
    hey everyone - I am using EKS with auto-scaling and
    DaskKubernetesEnvironment
    to run my workflows. There will be times where I will need to submit a high volume of flows to the cluster. My expectation is that the auto-scaler will spin up the additional EC2 instances required to handle to temporary volume increase but I am concerned about the latency (that is the time it takes to make the additional resources available). Is there a setting on the retry the entire flow if it fails? That way it reduces the gap.
    Jenny

    Jenny

    2 years ago
    Hi @John Ramirez - Thanks for the question. Let me look into that for you.
    Hi John, thinking this through - is there a specific reason behind allowing the flow-run to fail? Could you use a task to check resources etc and allow that to re-try until they're available? Or is there something I'm missing there?
    j

    John Ramirez

    2 years ago
    well i want to prevent a flow from failing. say I have two flow submitted 5 sec apart and there is only enough resources in the EKS cluster to support one and it take 3 mins for the auto-scaler to spin up another server. now what idk is will the second task start not have the resources to run and fail
    Jenny

    Jenny

    2 years ago
    Hi @John Ramirez , a few suggestions for your depending on how you want to design your flow and where you are concerned about failures. • If you're concerned about K8s failing the job because it doesn't have the correct resources, you should be ok as the job will sit in the queue until a new node spins up • if you're concerned that the dask cluster will be too small to successfully handle the job, you could set your environment to scale to a set number of workers and then block until it has the necessary resources with something like:
    from dask.distributed import get_client
    
    @task
    def wait_for_resources():
        client = get_client()
        # Wait until you have 10 workers
        client.wait_for_workers(n_workers=10)
    • Or if you're concerned only about stopping the flow run from failing, you could use a root task that waits for resources to become available before finishing - if you set your other tasks as dependent e.g. by using a trigger, they won't run until it's finished Also worth noting that flow-run-concurrency limiting is a roadmap item so you should be able to configure this natively at some point in the future.