I ve recently transitioned my stack from using an Agent to t Prefect Community #ask-community

I've recently transitioned my stack from using an ...

Robert Banick

05/01/2024, 3:51 PM

I've recently transitioned my stack from using an Agent to trigger flows over Amazon ECS to an ECS push work pool and am encountering some unexplained issues. During development work I was (eventually) able to trigger flow runs on demand over Fargate infrastructure. without issue. I'm now finding that runs set up on a schedule mostly (~75% of the time) fail to provision the requisite infrastructure, failing with a message saying

Copy code

Flow run infrastructure exited with non-zero status code:
 Exited with non 0 code. (Error Code: 1)
This may be caused by attempting to run an image with a misspecified platform or architecture.

Oddly, If I manually trigger the same flow runs they work without issue. Because push work pools don't provide logs and logs only go to Cloudwatch after infrastructure is provisioned it's proving hard to troubleshoot this issue. I remember @Will Raphaelson mentioning that the

exited with non 0 code

error was a particular bugbear and you all at @Prefect have experience troubleshooting it. Any ideas what could be causing my issues? I'm using Prefect 2.18.0 over AWS ECS (Fargate) infrastructure.

Kevin Grismore

05/01/2024, 4:00 PM

are you deploying with

prefect deploy

in the cli with a

prefect.yaml

file or

flow.deploy()

in python? and in either case, are you using those strategies to build docker images?

Kevin Grismore

05/01/2024, 4:10 PM

if you're building your images on an ARM machine like an M1-M3 mac you'll need to add

platform=linux/amd64

. I can show you where to add that depending on how you're deploying/building images

upvote 1

Robert Banick

05/01/2024, 4:40 PM

We're using

flow.deploy()

in python. We pull existing AWS Task Definitions which reference docker images we've already prepared and populated to AWS ECR. We are building over Linux boxes as part of a Github CI/CD chain -- not an M1-M3 Mac.

Kevin Grismore

05/01/2024, 4:52 PM

I wonder then if the wrong image is being used? I think you can check the image name on the ecs task itself, then maybe you can track it down and verify it's built for the right platform?

Robert Banick

05/01/2024, 5:31 PM

This doesn't really make any sense -- why would the wrong image be used for scheduled runs but not manually triggered runs? More pertinently we only really use one image for our stack, and it's not built on an ARM machine. If I look at the Task Definition in question for safety's sake it's referencing the image I want.

Kevin Grismore

05/01/2024, 5:33 PM

Yeah, you're right. I have another suspicion.

Kevin Grismore

05/01/2024, 5:34 PM

Is your agent still online? Does it have a work queue named

default

? And does your ECS push work pool also have a work queue named

default

Robert Banick

05/01/2024, 5:50 PM

hmmm

Robert Banick

05/01/2024, 5:51 PM

how would I check the work pools associated with the agent?

Robert Banick

05/01/2024, 5:51 PM

Sorry, the work queues

Robert Banick

05/01/2024, 5:52 PM

The old work pool I was using for an agent and the new push work pool both have

default

work queues. Actually they both have

default

work queues and work queues named for the flow in question -- e.g.

my-work-pool

Robert Banick

05/01/2024, 5:53 PM

should I delete those duplicate work queues? I can't delete

default

Kevin Grismore

05/01/2024, 5:53 PM

you should be able to create a new work queue in your push pool with a different name, and move your deployments over to it

Kevin Grismore

05/01/2024, 5:54 PM

I'm not 100% sure this will work, but I have seen it come up a number of times recently, manifesting in a few different types of strange behavior

Robert Banick

05/01/2024, 5:54 PM

Interesting. I"ll set this up for a few test runs and see if they pass. I'll report back here

✅ 1

Robert Banick

05/01/2024, 7:13 PM

2 / 4 scheduled flows went through -- not really a satisfactory improvement

Robert Banick

05/02/2024, 4:22 PM

No progress on this. I have noticed that the crash happens after some time -- the work pool appears to try to get the requisite infrastructure for quite a while before failing. However, on the runs it succeeds it succeeds quite quickly. After 2-3 minutes it's guaranteed to never succeed, even though it will keep trying for ~45 minutes.

Robert Banick

05/02/2024, 4:22 PM

I'm quite stumped @Kevin Grismore, any help you all can give would be great

Kevin Grismore

05/02/2024, 4:22 PM

going to follow up in dms

Robert Banick

05/02/2024, 4:22 PM

thanks!

Will L

05/30/2024, 9:08 PM

@Kevin Grismore Any resolution? I'm attempting to deploy an ECS work pool and keep running into a similar error-- each time I deploy from my server my infrastructure provisions and runs but I'm unable to get past "Flow run infrastructure exited with non-zero status code 1." I'm pushing to ECR from an M3 mac, but I'm using act and am running it with the flag

--container-architecture linux/amd64

and have

ENV DOCKER_DEFAULT_PLATFORM=linux/amd64

in my dockerfile

Will L

05/31/2024, 11:58 AM

adding this under

build

prefect.yaml

rectified my issue

Copy code

- prefect_docker.deployments.steps.build_docker_image:
      platform: linux/amd64

🎉 1

7 Views

Open in Slack

Previous Next