We are on the latest prefect amp prefect aws and seeing sche Prefect Community #ask-community

We are on the latest prefect & prefect_aws, an...

Carlo

10/28/2022, 1:02 PM

We are on the latest prefect & prefect_aws, and seeing scheduled jobs terminate without being run (intermittently). The UI for the flow run shows no logs, no task runs, and no subflow runs. In cloudwatch logs for the flow run, I found the below. Any thoughts?

Copy code

23:02:20.896 | INFO    | prefect.engine - Engine execution of flow run '682567a2-e507-4aa1-b5c7-25bb2cc2573f' aborted by orchestrator: This run has already terminated.

Carlo

10/28/2022, 1:03 PM

I should say, close to the latest:

Copy code

prefect==2.6.4
prefect_aws==0.1.6

Carlo

10/28/2022, 2:03 PM

In the agent logs, it seems like it timed out submitted the task

Carlo

10/28/2022, 2:04 PM

Does anyone know of any known issues or settings we need to tune?

Christopher Boyd

10/28/2022, 3:13 PM

It looks like it’s failing to submit to the infrastructure - as in, the agent tries to submit, hangs out for 120 seconds and dies because it wasn’t successful in submitting

Christopher Boyd

10/28/2022, 3:13 PM

seems like something going on with the infra side of ECS

Christopher Boyd

10/28/2022, 3:13 PM

are there enough resources?

Carlo

10/28/2022, 5:24 PM

We are running on Fargate, I don't know of limits on task submission there.

Christopher Boyd

10/28/2022, 5:31 PM

What do your agents have for logs? The only error I see is that the agent failed to submit to the infrastructure itself because it took too long

Christopher Boyd

10/28/2022, 5:31 PM

you can probably turn up your log level to debug as well

Carlo

10/28/2022, 5:31 PM

ok, let me try that

Carlo

10/28/2022, 5:32 PM

appreciate your help

Carlo

10/28/2022, 5:32 PM

what I posted for logs was all I currently have

Christopher Boyd

10/28/2022, 5:34 PM

I understand, it’s just hard to say for sure - from those, it looks like it tries to submit, hangs around before actually posting a job and dies. Then from cloudwatch (i’m not sure if maybe the job actually made it through, then tries to communicate to the engine only to find out that it was terminated?) My suspicion has to do with that initial submission - is this intermittent, or all jobs? some jobs? certain times of day? I guess i just wonder if it’s an issue with fargate and networking based on the error

Carlo

10/28/2022, 5:36 PM

Your intuition feels right to me. It's intermittent, but no consistency around timing. It could be aws networking, but we have been running prefect 1 for almost 2 years w/o facing this. I will bump up the logs and came back w/ findings

Christopher Boyd

10/28/2022, 5:39 PM

It’s definitely odd behavior , especially if you’ve been running for that long

Carlo

10/28/2022, 7:50 PM

The debug logs provided something concrete to research

Carlo

10/28/2022, 7:54 PM

Based on https://discourse.prefect.io/t/i-am-using-ecsagent-and-sometimes-i-get-an-error-that-sa[…]empts-to-create-a-new-revision-of-the-specified-family/100 & https://github.com/PrefectHQ/prefect/pull/5059 I think I need to set

Copy code

AWS_RETRY_MODE

Carlo

10/28/2022, 7:54 PM

I will try and post back

🙌 1

Christopher Boyd

10/28/2022, 8:05 PM

good find, awesome

Carlo

10/29/2022, 1:40 AM

So I realized that we had AWS_MAX_ATTEMPTS set at 10, so I bumped to 100.

Carlo

10/29/2022, 1:41 AM

With that said, it looks like this PR was merged to help w/ this. https://github.com/PrefectHQ/prefect/pull/5059/files I believe it includes the flow run id in in the family. Which I'm not sure is happening now w/ EcsTask block. Perhaps this is why it surfaced in V2.

Carlo

10/29/2022, 12:32 PM

Unfortunately bumping AWS_MAX_ATTEMPTS didn't help. I think prefect_aws should include the flow run id in the family

Carlo

10/29/2022, 3:14 PM

We hot fixed our agent to override this line https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/ecs.py#L949 w/ the below. I'll come back w/findings

Copy code

task_definition.setdefault("family", f"prefect-{self.labels['<http://prefect.io/flow-run-id|prefect.io/flow-run-id>']}")
task_definition['family'] = f"{task_definition['family']}-{self.labels['<http://prefect.io/flow-run-id|prefect.io/flow-run-id>']}"

Carlo

11/01/2022, 1:02 PM

Hey @Christopher Boyd, any thoughts on getting this change pulled into the project?

Christopher Boyd

11/01/2022, 1:34 PM

Hi Carlo, I can review the changes and open an issue this week, thank you for reporting

Carlo

11/01/2022, 1:36 PM

Thank you, I'll set a reminder to check back middle of next week

Christopher Boyd

11/03/2022, 1:28 PM

Hey @Carlo - https://github.com/PrefectHQ/prefect-aws/issues/148 is reported, if you would like to add any additional commentary or insight from your experience

Carlo

11/03/2022, 1:31 PM

Thanks, that describes it. I will follow issue and comment if I can provide insight

4 Views

Open in Slack

Previous Next