Hi I setup an ECS Push Work Pool on Prefect Cloud using an A Prefect Community #prefect-aws

Hi, I setup an ECS Push Work Pool on Prefect Cloud...

Luis Arias

08/07/2023, 3:37 PM

Hi, I setup an ECS Push Work Pool on Prefect Cloud using an Auto Scaling group capacity provider with a MinSize of zero. On attempting to run a simple hello world flow, I saw the following message which leads me to believe perhaps I shouldn't expect the flow to wait for auto scaling to kick in before bailing out. Is that the case or am I missing something?

Copy code

Flow run could not be submitted to infrastructure: Failed to run ECS task, cluster 'arn:aws:ecs:******' does not appear to have any container instances associated with it. Confirm that you have EC2 container instances available.

Jamie Zieziula

08/08/2023, 7:08 PM

I’m wondering if the cold start is too slow for prefect to handle. can you try a min size of 1 and confirm that things work as expected?

Luis Arias

08/10/2023, 12:19 PM

Hi @Jamie Zieziula I did try with a min size of 1 but then I ran into a similar error message as in this issue which is now closed so I will need to test again either by installing from main or waiting for the next release. https://github.com/PrefectHQ/prefect-aws/issues/301

Luis Arias

08/10/2023, 12:21 PM

I really need min size zero to work though because since ECS on Fargate doesn't support GPU instances, I can only use ECS on EC2 but I am just running some daily batch ML jobs which don't require keeping the instance around 24/7. I could potentially script some workaround to start an instance and stop it afterwards but was hoping this would work directly.

Geoffrey Keating

08/25/2023, 4:03 PM

Luis, I am seeing a similar problem when setting up a push work pool with GPU instances - do you have any more information or have you resolved the issue?

Geoffrey Keating

08/26/2023, 7:27 PM

@Jamie Zieziula Setting desired capacity to 1 does in fact work and I get the same problem mentioned in issue #301. Flows take quite some time to submit, likely because of my image size, so I increased the "Task Start Timeout" to 900 seconds instead of 300. This doesn't seem to help as the flow crashes well before 10+ minutes of waiting. Like Luis, I am also not able to have excess capacity of expensive EC2 instances running idle.

Luis Arias

08/31/2023, 1:41 PM

Hi @Geoffrey Keating and @Jamie Zieziula I have stumbled on a way to autoscale a push pool on ECS after an exchange with AWS support. Unfortunately, the configuration is not supported in the AWS Push Pool itself. I have been pointed out to this article which describes the right approach: https://repost.aws/questions/QUnKdakxvQROuUEjq2UWpa9g/should-ecs-ec2-asgprovider-capacity-provider-be-able-to-scale-up-from-zero-0-1 To use this, you need to replace the "launchStrategy" part of the template with a "capacityProviderStrategy" part similar to the following:

Copy code

"capacityProviderStrategy": [
        {
          "base": 0,
          "weight": 1,
          "capacityProvider": "OurAutoScalingGroup"
        }
      ]

For instance in my case I put it right after the "taskDefinition" part under "task_run_request". I haven't been able to test though because if I remove the "launchStrategy" part completely, then I get the following error message:

Copy code

Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Fargate requires task definition to have execution role ARN to support ECR images.

So it seems that Fargate is hard coded somewhere in the pool code. If I specify a Launch Strategy of None and keep the "launchStrategy" key in the template then I get the following error:

Copy code

Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Fargate requires task definition to have execution role ARN to support ECR images.

This might have gotten a bit further because I am using an image in our ECR registry. If I let the pool use the default image, then I get the following error as if its defaulting to Fargate again with the None setting.

Copy code

Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Tasks using the Fargate launch type do not support GPU resource requirements.

Luis Arias

08/31/2023, 2:00 PM

I see a closed PR related to this but not for the push pools. https://github.com/PrefectHQ/prefect/pull/5411

Geoffrey Keating

08/31/2023, 2:16 PM

Thank you Luis for the follow up! Perhaps our tasking is best executed via AWS Batch, but unfortunately it seems like using Batch will limit flow observability

Luis Arias

08/31/2023, 3:02 PM

I tried AWS Batch some time back and hit the same scaling to zero issue but I didn't know about the "capacityProviderStrategy" settings so I should try again.

Geoffrey Keating

08/31/2023, 3:09 PM

I think batch abstracts all of this away? I think I have my autoscaling group configured as posted in the links above

Luis Arias

09/01/2023, 7:00 AM

Hmm ok, could you please post back here if that works for you? Crossing fingers!

Geoffrey Keating

09/01/2023, 2:58 PM

Not getting anywhere on this one, no. While I don't think this is necessarily prefect related, I would like to know if this is an expected pattern for the ECS Push Worker. I feel like having no capacity at rest for heavy GPU instances is very desirable...

😞 1

Geoffrey Keating

09/01/2023, 7:17 PM

@Jamie Zieziula Do you have any ideas on this issue?

Taylor Curran

09/01/2023, 7:53 PM

Hi @Luis Arias have you checked out the advanced tab in the work pools configuration? — more info coming, just want to ask before i send over info you already have

Taylor Curran

09/01/2023, 7:55 PM

We’ve come a long way since this PR. The work pool is designed to handle very custom configuration through the advanced tab.

Taylor Curran

09/01/2023, 8:04 PM

In the advanced tab of the Work Pool, there are two sections, the

variables

section and the

job_configuration

section that allow you to customize how the worker interacts with the infrastructure API, in this case, ECS’s API. Find the RunTask request syntax here, anything in this RunTask syntax that is not included in the job_configuration section by default (in this case, a specific capacityProviderStrategy) can be added in. You’ll also want to add in new fields such as

weight

or potentially

capacityProvider

(depending on how much they change job to bot ) to the variables section of the advanced tab, these values can be passed to the job_configuration through jinja templating.

Taylor Curran

09/01/2023, 8:04 PM

I wonder if you can inject

Copy code

"capacityProviderStrategy": [
        {
          "base": 0,
          "weight": 1,
          "capacityProvider": "OurAutoScalingGroup"
        }
      ]

Into the job configuration section of the advanced tab?

Taylor Curran

09/01/2023, 8:05 PM

Watch the last 8 mins of this video to know what im talking about for a quick demo on this: https://www.youtube.com/live/1tv6w22o7mI?feature=share&t=2002

🙌 1

Luis Arias

09/02/2023, 8:26 AM

Hi @Taylor Curran! How are you? Thanks for your help! 🙌 I have tried to add the relevant JSON in the advanced tab previously, but I get the above errors. I will check out the video, I'm probably doing something wrong. If I'm still stuck I'll post the JSON here.

Luis Arias

09/06/2023, 4:31 PM

So @Taylor Curran sorry to be so late in getting back to this. We are gearing up for demo day at our accelerator tomorrow and it has been difficult to get a break from the programming. I followed the instructions in the video and removed the "launch_type" variable and associated value in the "task_run_request" and added a new "capacity_provider" variable and "capacityProviderStrategy" section in the "task_run_request". Unfortunately this results in the following error:

Copy code

Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Fargate requires task definition to have execution role ARN to support ECR images.

Without thoroughly understanding where exactly in the codebase the task run request takes place, it seems to indicate that there might be a default "launchType" of FARGATE that is being set. This seems ok but should only take place if the "capacityProviderStrategy" key is not present since they are mutually exclusive. I created a gist with the template for reference: https://gist.github.com/kaaloo/de723421fb6fda6965fda3d3af5b6dc2

Luis Arias

09/06/2023, 4:49 PM

Perhaps it has something to do with this line of code. I'm not quite sure if this is the same code that is being used by the push work pool but mutual exclusion of

capacityProviderStrategy

with

launchStrategy

is not taken into account. https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/workers/ecs_worker.py#L854C1-L854C1

Taylor Curran

09/06/2023, 4:55 PM

Thank you for the update Luis, could you create a github issue with this info in the prefect-aws repo? I’ll make sure someone from our integration team takes a look.

Luis Arias

09/11/2023, 11:49 AM

Absolutely. I'll do that now and post the link back here. Thanks for your help!

Luis Arias

09/11/2023, 12:03 PM

Here you are @Taylor Curran! https://github.com/PrefectHQ/prefect-aws/issues/310

🙌 1

James Gatter

09/12/2023, 6:59 PM

Hi all, I just tried to get an ECS push work pool working and I encountered the same issue. Setting the minimum and desired capacity to 1 on the auto scaling group worked around the issue, but this isn't great for resource-demanding instance types as noted. Just hoping to understand: does Prefect not use my auto scaling group by default and@Luis Arias' proposal to support configuration for capacityProvider will allow it to be plugged in and used?

Taylor Curran

09/12/2023, 10:39 PM

Hi @James Gatter could you comment on the issue that Luis made explaining that you are also getting this issue? https://github.com/PrefectHQ/prefect-aws/issues/310

Luis Arias

09/13/2023, 8:47 AM

Hi @James Gatter, exactly. The current version doesn't support setting the

capacityProviderStrategy

because of some logic requiring a

launchType

when making the task run request. I'm actually working on contributing a PR for this but have had limited capacity to work on it yesterday. Giving it another shot today.

❤️ 1

48 Views

Open in Slack

Previous Next