Hi, I setup an ECS Push Work Pool on Prefect Cloud...
# prefect-aws
l
Hi, I setup an ECS Push Work Pool on Prefect Cloud using an Auto Scaling group capacity provider with a MinSize of zero. On attempting to run a simple hello world flow, I saw the following message which leads me to believe perhaps I shouldn't expect the flow to wait for auto scaling to kick in before bailing out. Is that the case or am I missing something?
Copy code
Flow run could not be submitted to infrastructure: Failed to run ECS task, cluster 'arn:aws:ecs:******' does not appear to have any container instances associated with it. Confirm that you have EC2 container instances available.
j
I’m wondering if the cold start is too slow for prefect to handle. can you try a min size of 1 and confirm that things work as expected?
l
Hi @Jamie Zieziula I did try with a min size of 1 but then I ran into a similar error message as in this issue which is now closed so I will need to test again either by installing from main or waiting for the next release. https://github.com/PrefectHQ/prefect-aws/issues/301
I really need min size zero to work though because since ECS on Fargate doesn't support GPU instances, I can only use ECS on EC2 but I am just running some daily batch ML jobs which don't require keeping the instance around 24/7. I could potentially script some workaround to start an instance and stop it afterwards but was hoping this would work directly.
g
Luis, I am seeing a similar problem when setting up a push work pool with GPU instances - do you have any more information or have you resolved the issue?
@Jamie Zieziula Setting desired capacity to 1 does in fact work and I get the same problem mentioned in issue #301. Flows take quite some time to submit, likely because of my image size, so I increased the "Task Start Timeout" to 900 seconds instead of 300. This doesn't seem to help as the flow crashes well before 10+ minutes of waiting. Like Luis, I am also not able to have excess capacity of expensive EC2 instances running idle.
l
Hi @Geoffrey Keating and @Jamie Zieziula I have stumbled on a way to autoscale a push pool on ECS after an exchange with AWS support. Unfortunately, the configuration is not supported in the AWS Push Pool itself. I have been pointed out to this article which describes the right approach: https://repost.aws/questions/QUnKdakxvQROuUEjq2UWpa9g/should-ecs-ec2-asgprovider-capacity-provider-be-able-to-scale-up-from-zero-0-1 To use this, you need to replace the "launchStrategy" part of the template with a "capacityProviderStrategy" part similar to the following:
Copy code
"capacityProviderStrategy": [
        {
          "base": 0,
          "weight": 1,
          "capacityProvider": "OurAutoScalingGroup"
        }
      ]
For instance in my case I put it right after the "taskDefinition" part under "task_run_request". I haven't been able to test though because if I remove the "launchStrategy" part completely, then I get the following error message:
Copy code
Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Fargate requires task definition to have execution role ARN to support ECR images.
So it seems that Fargate is hard coded somewhere in the pool code. If I specify a Launch Strategy of None and keep the "launchStrategy" key in the template then I get the following error:
Copy code
Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Fargate requires task definition to have execution role ARN to support ECR images.
This might have gotten a bit further because I am using an image in our ECR registry. If I let the pool use the default image, then I get the following error as if its defaulting to Fargate again with the None setting.
Copy code
Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Tasks using the Fargate launch type do not support GPU resource requirements.
I see a closed PR related to this but not for the push pools. https://github.com/PrefectHQ/prefect/pull/5411
g
Thank you Luis for the follow up! Perhaps our tasking is best executed via AWS Batch, but unfortunately it seems like using Batch will limit flow observability
l
I tried AWS Batch some time back and hit the same scaling to zero issue but I didn't know about the "capacityProviderStrategy" settings so I should try again.
g
I think batch abstracts all of this away? I think I have my autoscaling group configured as posted in the links above
l
Hmm ok, could you please post back here if that works for you? Crossing fingers!
g
Not getting anywhere on this one, no. While I don't think this is necessarily prefect related, I would like to know if this is an expected pattern for the ECS Push Worker. I feel like having no capacity at rest for heavy GPU instances is very desirable...
😞 1
@Jamie Zieziula Do you have any ideas on this issue?
t
Hi @Luis Arias have you checked out the advanced tab in the work pools configuration? — more info coming, just want to ask before i send over info you already have
We’ve come a long way since this PR. The work pool is designed to handle very custom configuration through the advanced tab.
In the advanced tab of the Work Pool, there are two sections, the
variables
section and the
job_configuration
section that allow you to customize how the worker interacts with the infrastructure API, in this case, ECS’s API. Find the RunTask request syntax here, anything in this RunTask syntax that is not included in the job_configuration section by default (in this case, a specific capacityProviderStrategy) can be added in. You’ll also want to add in new fields such as
weight
or potentially
capacityProvider
(depending on how much they change job to bot ) to the variables section of the advanced tab, these values can be passed to the job_configuration through jinja templating.
I wonder if you can inject
Copy code
"capacityProviderStrategy": [
        {
          "base": 0,
          "weight": 1,
          "capacityProvider": "OurAutoScalingGroup"
        }
      ]
Into the job configuration section of the advanced tab?
Watch the last 8 mins of this video to know what im talking about for a quick demo on this: https://www.youtube.com/live/1tv6w22o7mI?feature=share&t=2002
🙌 1
l
Hi @Taylor Curran! How are you? Thanks for your help! 🙌 I have tried to add the relevant JSON in the advanced tab previously, but I get the above errors. I will check out the video, I'm probably doing something wrong. If I'm still stuck I'll post the JSON here.
So @Taylor Curran sorry to be so late in getting back to this. We are gearing up for demo day at our accelerator tomorrow and it has been difficult to get a break from the programming. I followed the instructions in the video and removed the "launch_type" variable and associated value in the "task_run_request" and added a new "capacity_provider" variable and "capacityProviderStrategy" section in the "task_run_request". Unfortunately this results in the following error:
Copy code
Flow run could not be submitted to infrastructure: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Fargate requires task definition to have execution role ARN to support ECR images.
Without thoroughly understanding where exactly in the codebase the task run request takes place, it seems to indicate that there might be a default "launchType" of FARGATE that is being set. This seems ok but should only take place if the "capacityProviderStrategy" key is not present since they are mutually exclusive. I created a gist with the template for reference: https://gist.github.com/kaaloo/de723421fb6fda6965fda3d3af5b6dc2
Perhaps it has something to do with this line of code. I'm not quite sure if this is the same code that is being used by the push work pool but mutual exclusion of
capacityProviderStrategy
with
launchStrategy
is not taken into account. https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/workers/ecs_worker.py#L854C1-L854C1
t
Thank you for the update Luis, could you create a github issue with this info in the prefect-aws repo? I’ll make sure someone from our integration team takes a look.
l
Absolutely. I'll do that now and post the link back here. Thanks for your help!
🙌 1
j
Hi all, I just tried to get an ECS push work pool working and I encountered the same issue. Setting the minimum and desired capacity to 1 on the auto scaling group worked around the issue, but this isn't great for resource-demanding instance types as noted. Just hoping to understand: does Prefect not use my auto scaling group by default and@Luis Arias' proposal to support configuration for capacityProvider will allow it to be plugged in and used?
t
Hi @James Gatter could you comment on the issue that Luis made explaining that you are also getting this issue? https://github.com/PrefectHQ/prefect-aws/issues/310
l
Hi @James Gatter, exactly. The current version doesn't support setting the
capacityProviderStrategy
because of some logic requiring a
launchType
when making the task run request. I'm actually working on contributing a PR for this but have had limited capacity to work on it yesterday. Giving it another shot today.
❤️ 1