https://prefect.io logo
Title
d

Dylan

04/13/2023, 10:42 AM
Hi all, I'm hoping someone can assist in solving this issue I'm experiencing when running a Prefect Agent on ECS. It seems that the ECS Task works fine when it's started by the ECS Service and will pick up any work that has been scheduled or triggered. However, after a couple of hours of the ECS Task waiting for work, it throws an error (see attached screenshots of Cloudwatch Logs). After this error is thrown, the ECS Task container stays active but does not pick up any work. This causes schedule flows to be late until I manually stop the container and let the ECS Service provision a new ECS Task container. This is not ideal as we have some flows scheduled for graveyard hours. Two questions: 1. Any idea what is causing this error? 2. Is there a configuration that I am missing that will stop the container when an error occurs so that the ECS Service can provision a fresh container?
1
I suppose a workaround is to create a Cloudwatch Subscription filter to look out for errors and trigger a lambda function to stop the task.
d

David Hlavaty

04/14/2023, 10:48 AM
Looks like https://github.com/PrefectHQ/prefect/issues/7442, disabling HTTP 2 as suggested on the issue worked for us
prefect config set PREFECT_API_ENABLE_HTTP2=false
d

Dylan

04/14/2023, 1:33 PM
Hi David, thanks for your comment and linking the github issue. I'll give that a go🤞
That seems to have worked, thanks again @David Hlavaty
y

Yaron Levi

04/30/2023, 11:16 AM
@Dylan Have you tried running the agent as an ECS Service. A Service will make sure that even when the Task fails or crash, a new one will start immediately.
d

Dylan

05/02/2023, 10:52 AM
Hi @Yaron Levi I assumed I was already running the agent as an ECS Service but perhaps I have this misconfigured then. I followed this prefect recipe - https://github.com/PrefectHQ/prefect-recipes/tree/main/devops/infrastructure-as-code/aws/tf-prefect2-ecs-agent
y

Yaron Levi

05/02/2023, 10:54 AM
CleanShot 2023-05-02 at 13.54.03@2x.jpg
and a regular, default Task Definition with those changes:
CleanShot 2023-05-02 at 13.55.42@2x.jpg
(and also the PREFECT_URL and PREFECT_KEY in the env vars)
d

David Hlavaty

05/02/2023, 11:06 AM
Running as a service won't help here as prefect agent does not die on those errors. So as far as the service is concerned, the task is healthy so it doesn't have need to spin up a new one
Looks like the above issue of agent not crashing was fixed 2 weeks ago: https://github.com/PrefectHQ/prefect/pull/9267