I've recently transitioned my stack from using an ...
# ask-community
r
I've recently transitioned my stack from using an Agent to trigger flows over Amazon ECS to an ECS push work pool and am encountering some unexplained issues. During development work I was (eventually) able to trigger flow runs on demand over Fargate infrastructure. without issue. I'm now finding that runs set up on a schedule mostly (~75% of the time) fail to provision the requisite infrastructure, failing with a message saying
Copy code
Flow run infrastructure exited with non-zero status code:
 Exited with non 0 code. (Error Code: 1)
This may be caused by attempting to run an image with a misspecified platform or architecture.
Oddly, If I manually trigger the same flow runs they work without issue. Because push work pools don't provide logs and logs only go to Cloudwatch after infrastructure is provisioned it's proving hard to troubleshoot this issue. I remember @Will Raphaelson mentioning that the
exited with non 0 code
error was a particular bugbear and you all at @Prefect have experience troubleshooting it. Any ideas what could be causing my issues? I'm using Prefect 2.18.0 over AWS ECS (Fargate) infrastructure.
k
are you deploying with
prefect deploy
in the cli with a
prefect.yaml
file or
flow.deploy()
in python? and in either case, are you using those strategies to build docker images?
if you're building your images on an ARM machine like an M1-M3 mac you'll need to add
platform=linux/amd64
. I can show you where to add that depending on how you're deploying/building images
upvote 1
r
We're using
flow.deploy()
in python. We pull existing AWS Task Definitions which reference docker images we've already prepared and populated to AWS ECR. We are building over Linux boxes as part of a Github CI/CD chain -- not an M1-M3 Mac.
k
I wonder then if the wrong image is being used? I think you can check the image name on the ecs task itself, then maybe you can track it down and verify it's built for the right platform?
r
This doesn't really make any sense -- why would the wrong image be used for scheduled runs but not manually triggered runs? More pertinently we only really use one image for our stack, and it's not built on an ARM machine. If I look at the Task Definition in question for safety's sake it's referencing the image I want.
k
Yeah, you're right. I have another suspicion.
Is your agent still online? Does it have a work queue named
default
? And does your ECS push work pool also have a work queue named
default
?
r
hmmm
how would I check the work pools associated with the agent?
Sorry, the work queues
The old work pool I was using for an agent and the new push work pool both have
default
work queues. Actually they both have
default
work queues and work queues named for the flow in question -- e.g.
my-work-pool
should I delete those duplicate work queues? I can't delete
default
k
you should be able to create a new work queue in your push pool with a different name, and move your deployments over to it
I'm not 100% sure this will work, but I have seen it come up a number of times recently, manifesting in a few different types of strange behavior
r
Interesting. I"ll set this up for a few test runs and see if they pass. I'll report back here
1
2 / 4 scheduled flows went through -- not really a satisfactory improvement
No progress on this. I have noticed that the crash happens after some time -- the work pool appears to try to get the requisite infrastructure for quite a while before failing. However, on the runs it succeeds it succeeds quite quickly. After 2-3 minutes it's guaranteed to never succeed, even though it will keep trying for ~45 minutes.
I'm quite stumped @Kevin Grismore, any help you all can give would be great
k
going to follow up in dms
r
thanks!
w
@Kevin Grismore Any resolution? I'm attempting to deploy an ECS work pool and keep running into a similar error-- each time I deploy from my server my infrastructure provisions and runs but I'm unable to get past "Flow run infrastructure exited with non-zero status code 1." I'm pushing to ECR from an M3 mac, but I'm using act and am running it with the flag
--container-architecture linux/amd64
and have
ENV DOCKER_DEFAULT_PLATFORM=linux/amd64
in my dockerfile
adding this under
build
in
prefect.yaml
rectified my issue
Copy code
- prefect_docker.deployments.steps.build_docker_image:
      platform: linux/amd64
🎉 1