Hello all, in our development environment currentl...
# prefect-community
m
Hello all, in our development environment currently we are using a ECS Agent and have a flow of flows, that is running a large-ish batch of small pipelines concurrently, however we are getting a mix of Throttle Errors (RegisterTaskDefinition, DeregisterTaskDefinition). My agent has the AWS_RETRY_MODE and AWS_MAX_ATTEMPTS env variables as 'adaptive' and '25' respectively, but I'm still getting this error, I'm up for splitting this up into smaller calls, but I'm curious if this is a situation anyone has come across and how it was solved. How did you implement the wait/backoff logic on the flow, and is there a way to iterate with waits to implement less calls to the aws api?
To elaborate on the large-ish batch of small pipelines concurrently.. depending on the day, and source availability, we are running 40-110 pipelines from the flow of flow
k
Hi @Michael Moscater, looks like you followed this already. I think if you can have wait logic on the Flow, I think you can explore flow run concurrency limiting to spread out the API calls? I have seen some people use AWS Batch instead for the big flow triggering many small jobs. Of course, the Batch jobs are not Prefect Flows anymore so you don’t have visibility in the UI. But for stuff like 300+ subflows, I am not sure there is a more efficient way in ECS
m
Yeah, AWS batch is an option, but like you said it eliminates the monitoring and visibility in a single place. Concurrency limits looks helpful, but unfortunately.. We are not 'allowed' to use Prefect Cloud, to my chagrin.
k
Ah ok. Maybe you can try splitting the big flow into 2-3 less big flows if possible like you suggested? These would count against the API limits differently I believe
m
Thanks Kevin, I'll give it a try and let you know what I come up with for future ref
a
The throttling you see involves the issue with registering flows as ECS tasks, but you can force Prefect to use a preregistered task definition by providing a specific task definition ARN on your ECSRun. This way, Prefect will not attempt to register a new task definition. The downside is that you need to sort of manually handle new flow versions by reregistering new ECS task definitions before registering a new flow version
m
@Anna Geller - Currently I have a task definition stored in s3 that my flows are referencing. Am i misunderstanding what you are saying?
I was able to create a task that just waits and then creates the next set of runs, it works. But if there is a better way with what you are referring to Anna, I am more than willing to try it
I think I see what you are saying.. within ECS, create a task definition to use (within ECS > Task Definitions), as opposed to prefect generating it at runtime using the definition template?
a
@Michael Moscater you can see here that if you don't provide task_definition_arn, then Prefect will register a new task definition automatically for you (but this may cause throttling issues) but if you set it explicitly, Prefect will only describe this task definition and run task (i.e. will trigger a new ECS task out of this existing ECS task definition)
m
ah ok, So I'm using task defintion path, and its using that path to register the task definition at run time on each flow. If i use the task_definition_arn kwarg, then it will use the preregistered, and shouldn't throttle registration, so in theory.. 40 tasks should spawn using that same definition without throttling. Am i tracking?
a
That's exactly the case! Basically, using a preregistered task definition will avoid the throttling issues that happen during the registration of new ECS tasks
m
ok, I'll give this a try, and I'm using a custom file to run to register my flows with different run configs based on its directory from our CI/CD tool. It looks like i might be able to handle the registration of new definitions because the cli response gives the arn (accoriding to the aws docs), so i can add this to my build when using different definitions and insert the new arn in the ECSRun args. As always, thank you so much for the assistance.
🙌 1
@Anna Geller - Thank you.. this is working beautifully. Tested it with 3 different task definitions, and had 96 flows run concurrently with no issues (other than a few stuck in submitted, but lazarus picked those up)
a
Great work! And thanks for updating us on that! 🙌