We are on the latest prefect & prefect_aws, an...
# prefect-community
c
We are on the latest prefect & prefect_aws, and seeing scheduled jobs terminate without being run (intermittently). The UI for the flow run shows no logs, no task runs, and no subflow runs. In cloudwatch logs for the flow run, I found the below. Any thoughts?
Copy code
23:02:20.896 | INFO    | prefect.engine - Engine execution of flow run '682567a2-e507-4aa1-b5c7-25bb2cc2573f' aborted by orchestrator: This run has already terminated.
I should say, close to the latest:
Copy code
prefect==2.6.4
prefect_aws==0.1.6
In the agent logs, it seems like it timed out submitted the task
Does anyone know of any known issues or settings we need to tune?
c
It looks like it’s failing to submit to the infrastructure - as in, the agent tries to submit, hangs out for 120 seconds and dies because it wasn’t successful in submitting
seems like something going on with the infra side of ECS
are there enough resources?
c
We are running on Fargate, I don't know of limits on task submission there.
c
What do your agents have for logs? The only error I see is that the agent failed to submit to the infrastructure itself because it took too long
you can probably turn up your log level to debug as well
c
ok, let me try that
appreciate your help
what I posted for logs was all I currently have
c
I understand, it’s just hard to say for sure - from those, it looks like it tries to submit, hangs around before actually posting a job and dies. Then from cloudwatch (i’m not sure if maybe the job actually made it through, then tries to communicate to the engine only to find out that it was terminated?) My suspicion has to do with that initial submission - is this intermittent, or all jobs? some jobs? certain times of day? I guess i just wonder if it’s an issue with fargate and networking based on the error
c
Your intuition feels right to me. It's intermittent, but no consistency around timing. It could be aws networking, but we have been running prefect 1 for almost 2 years w/o facing this. I will bump up the logs and came back w/ findings
c
It’s definitely odd behavior , especially if you’ve been running for that long
c
The debug logs provided something concrete to research
I will try and post back
🙌 1
c
good find, awesome
c
So I realized that we had AWS_MAX_ATTEMPTS set at 10, so I bumped to 100.
With that said, it looks like this PR was merged to help w/ this. https://github.com/PrefectHQ/prefect/pull/5059/files I believe it includes the flow run id in in the family. Which I'm not sure is happening now w/ EcsTask block. Perhaps this is why it surfaced in V2.
Unfortunately bumping AWS_MAX_ATTEMPTS didn't help. I think prefect_aws should include the flow run id in the family
We hot fixed our agent to override this line https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/ecs.py#L949 w/ the below. I'll come back w/findings
Copy code
task_definition.setdefault("family", f"prefect-{self.labels['<http://prefect.io/flow-run-id|prefect.io/flow-run-id>']}")
task_definition['family'] = f"{task_definition['family']}-{self.labels['<http://prefect.io/flow-run-id|prefect.io/flow-run-id>']}"
Hey @Christopher Boyd, any thoughts on getting this change pulled into the project?
c
Hi Carlo, I can review the changes and open an issue this week, thank you for reporting
c
Thank you, I'll set a reminder to check back middle of next week
c
Hey @Carlo - https://github.com/PrefectHQ/prefect-aws/issues/148 is reported, if you would like to add any additional commentary or insight from your experience
c
Thanks, that describes it. I will follow issue and comment if I can provide insight