https://prefect.io logo
Title
b

Brian Phillips

03/10/2022, 2:35 PM
Has anyone encountered the following error when using the ECS agent? Is there an easy way to increase the number of retries or backoff interval?
An error occurred (ThrottlingException) when calling the RegisterTaskDefinition operation (reached max retries: 2): Rate exceeded
k

Kevin Kho

03/10/2022, 2:36 PM
I think this will help you
b

Brian Phillips

03/10/2022, 2:39 PM
Perfect, thanks!
@Kevin Kho follow up question, we encountered this error in one subflow in a large flow of flows (output is from
aws ecs describe-tasks
). In Prefect cloud, this flow has been stuck in a submitted state for 40+ minutes. Do you know of a way to handle this better in the agent?
"stopCode": "TaskFailedToStart",
            "stoppedReason": "Timeout waiting for network interface provisioning to complete.",
k

Kevin Kho

03/10/2022, 4:27 PM
So did this fail on the ECS side already but Prefect Cloud did not update the state?
b

Brian Phillips

03/10/2022, 4:28 PM
It looks like yeah, the ECS task is stopped, but still in submitted state in Prefect cloud. I was hoping it would be rescheduled pretty quickly, but that doesn't appear to have happened.
k

Kevin Kho

03/10/2022, 4:33 PM
I think Lazarus should kick in and retry, but I think the more certain way to treat this if you can is you use an Automation to Fail it on Prefect Cloud if it does not start in a certain amount of time. I will ask the team about Lazarus though
a

Anna Geller

03/10/2022, 4:41 PM
I agree with Kevin on both: 1. Lazarus should resurrect this flow run 2. You can use SLA Automations to cancel flow runs if they failed to start after X time - but note that you have to create it separately for each flow, so it might be a bit tedious to set it up for all flows You mentioned it happened only in one subflow, correct? In that case, you could create this automation for this "problematic" child flow only Any chance you could share a short snippet of how you call this specific subflow in your parent flow and how did you define your ECS run config?
AWS suggests manually retrying the task 😄 not too helpful. So I believe that Lazarus should resurrect this flow run after some time - if you are on Prefect Cloud, you can send us the flow run ID of this flow run so that we can have a look
b

Brian Phillips

03/10/2022, 5:16 PM
Here is the parent flow run id
495f390e-8a08-4ae9-8586-d4688b5f5ca7
and child flow run id
e1e56e25-e922-49a5-979f-7357da3bc339
. I ended up manually cancelling the child flow run so the parent flow failed. This is the code I'm using to kick off the child flows
child_ids_result = create_flow_run.map(
        flow_name=unmapped(...),
        project_name=unmapped(...),
        parameters=parameters_result,
        run_name=run_names_result,
    )
    wait_for_flow_run_result = wait_for_flow_run.map(
        flow_run_id=child_ids_result,
        stream_states=unmapped(True),
        stream_logs=unmapped(True),
        raise_final_state=unmapped(True),
        upstream_tasks=[unmapped(child_ids_result)]
    )
Configs are all quite standard ECS Run
ECSRun(
            image=BASE_DOCKER_IMAGE,
            cpu=self.__cpu,
            memory=self.__memory,
            labels=self.__labels,
            env=...,
        )
Task Kwargs
Body: !Sub
        - |
          networkConfiguration:
            awsvpcConfiguration:
              subnets: [${SubnetIds}]
              securityGroups: [${SecurityGroupIds}]
              assignPublicIp: ENABLED
        - SubnetIds: !Join [",", !Ref SubnetIds]
          SecurityGroupIds: !Join [",", !Ref SecurityGroupIds]
Flow Definition
Body: !Sub |
        containerDefinitions:
          - essential: true
            image: ${DockerImage}
            name: flow
            repositoryCredentials:
              credentialsParameter: ...
        cpu: 1024
        memory: 2048
        networkMode: awsvpc
        requiresCompatibilities: [FARGATE]
It does seem to be an uncommon but expected error with Fargate networking. Hopefully there is a way to get Lazurus to reschedule
a

Anna Geller

03/10/2022, 5:21 PM
thanks so much for sharing, this is super helpful you don't need this line since you pass IDs as data dependencies thereby setting dependencies implicitly:
upstream_tasks=[unmapped(child_ids_result)]
the self on ECSRun worries me a bit - how do you call it?
The ECS arguments are fine as well. I looked at the logs and it seems to be a transient issue caused by AWS infrastructure - AWS failed to provisioned network interface for the Fargate micro VM, it tried to then retry this operation X times and after X attempts it claimed this operation as failed. Normally, this flow run stuck in Submitted state should get resurrected by Lazarus, for some reason it didn't happen here - I can't tell why. The issue shouldn't be something that is likely to happen frequently in the future, AWS is usually quite good in handling such infrastructure failures and retries with Fargate - if it starts happening freuqently, let us know. Until then, you may create the automation to cancel such runs after failing to start for X minutes
b

Brian Phillips

03/10/2022, 5:34 PM
That line ensures all "create flow tasks" run before any "wait for flow tasks", previously I was seeing create_flow -> wait_for_flow -> create_flow -> wait_for_flow which was not the behavior I wanted. In our flow subclass
class LocalFlow(Flow):
    def __init__(self, cpu, memory, labels, ...):
        self.__cpu = cpu
        self.__memory = memory
        self.__labels = labels

         super().__init__(
            ...,
            run_config=Proxy(self._get_run_config),
        )

    def _get_run_config(self) -> RunConfig:
        return ECSRun(...)
Thanks for the analysis, that's helpful 🙂
👍 1
Happened again with
2fa5dc27-6975-4739-abe0-d5662427363e
. No attempt to reschedule. I think I may need to implement a custom task to babysit and retry these flow runs
k

Kevin Kho

03/10/2022, 6:18 PM
Client.set_flow_run_state
might be helpful you for since you will have the ID from the create_flow_run call. Just making sure you know about it.
:upvote: 1
b

Brian Phillips

03/10/2022, 6:20 PM
Thanks
a

Anna Geller

03/10/2022, 6:46 PM
also to cancel you can use 🙂
client = Client()
 client.cancel_flow_run(flow_run_id)