Hi marvin, we are getting an error from the prefe...
# ask-marvin
n
Hi marvin, we are getting an error from the prefect agent with the following text: "reported flow run as crashed: Flow run infrastructure exited with non-zero status code 2" What can we do about it?
n
Hi @Marvin, we are getting an error from the prefect agent with the following text: "reported flow run as crashed: Flow run infrastructure exited with non-zero status code 2" What can we do about it?
m
When an agent reports that a flow run has crashed with a non-zero status code like
2
, this often means that there was an issue with the flow run execution infrastructure. Here are some steps you can take to troubleshoot the problem: 1. Check the agent logs to see if anything suspicious stands out there. 2. Verify that your execution layer is able to pull your flow run’s image. For instance, if the image needs to be pulled from a container registry, ensure your execution environment can reach the internet and has the appropriate permissions to pull the image. 3. Ensure that your execution layer has the necessary permissions to spin up the required resources. This could involve things like IAM roles or a valid Prefect API key. 4. Confirm that your execution layer has enough capacity on the cluster to deploy your flow run. If there's insufficient capacity or resources, the run might not start as expected. If these checks don't reveal any issues, you may want to delve into more detailed logs or system metrics to see if there are any other clues as to why the flow run failed to execute properly. Additional information can be found on Prefect's Discourse.
n
This did not solve the issue. I have 2 identical environments (production and staging) and in staging it runs without a problem but in production i am having this error but no additional logs
n
what do the agent logs show? (not the flow run logs)
n
Flow logs on prefect web are empty. Agent logs only shows the exited with non-zero exit code 2
Prefect version of agent and task is 2.10.3
n
in staging do you have a newer version? this might be another case of this
n
Same version 😕
n
hmm im not sure off the top then w/o more info, was there something that changed recently where this started happening? can you show how you're creating deployments?
n
It's funny, we were using the prefect-2 latest in another project and we hit the execute error 2 weeks ago when an egent crashed and pulled the latest dependency, pinning the version to an older one solved the issue for that. In here we thought it would be the same, so when the flows started crashing last week in staging we pinned the agent's version to 2.14.3 and it started working normally. we did the same on prod on Monday when it failed but it didn't work for us here.
Both configs have the exact same flows and AWS resources and use the same image for the executor
Any help is greatly appreciated. Any idea how could we get better logging or more clues on this issue?
For creating deploys we have a python file where we deploy flows for both prod and staging, and this is ran by github actions
n
that deploy script looks normal to me if I were debugging this I'd want to test by running my ecs agent locally on newest prefect, then maybe the older version you're on, and see if its some credentials issue or perhaps a dependency issue
n
Got it. They are running locally well with the old version
I'll try with the newest one
It still runs fine in local. i updated the prod and staging versions to run the latest version (2.14.10) on the executor and task and now it crashes on both places with code 1. No logs still
n
when you say it runs fine locally, you mean that you're running your agent locally, and its submitting jobs to ECS, right? that's what I meant by > running my ecs agent locally
because if this is true, it seems almost certain to be some credentials problem with the ECS service that's running your agent
like it cant pull code from s3 or something (although I would expect a better error for that specific problem)
n
Locally i don't submit jobs to ECS, i run them in a local container
n
gotcha, the reason I suggested this > if I were debugging this I'd want to test by running my ecs agent locally on newest prefect, then maybe the older version you're on, and see if its some credentials issue or perhaps a dependency issue was to suggest you try to narrow the potential CoE by seeing if there's some permissions problem with the ECS service that runs your agent (not your flow run containers) at least, given > i run them in a local container you know its probably not a dependency issue, since I assume your local docker build was probably similar enough to your ECS container builds for flow runs
n
Yep. That's how i feel. I mean, there were no credential changes in any of the deployments and staging env was executing jobs without issue, so i don't feel that's the actual issue
I am checking every credential nervetheless
n
how exactly are you running your ECS agent? based on the info i have, thats what i’m most suspicious of right now
n
I'll respond that in a moment, just one question that came up while trying to boot things up with an older version of prefect. The agent automatically updated itself on loading, By any chance do you know how can we prevent an agent from upgrading itself when running from the prefect docker image?
(BTW, i really appreciate all the help you have provided me today, thanks!)
n
no problem! > how can we prevent an agent from upgrading itself i think again the most helpful piece of info on this question is to know how exactly you're running the agent i assume its something like step 4 here? but only for agents, of course
n
In regards to how are we running our ECS agent, we are are booting it up from a JSON definition.
n
why is
"cpu": 0
?
n
Everything else in that file are simply environment variables
I do not know the answer to that to ne honest
n
hmm that is odd, I would expect 0 to be an invalid input, which I would think would error out before the container definition can even be registered with ECS. I usually expect to see 512 or 1024 there
n
else where in the file we are defining cpu: 512
👍 1
I guess if we can 2.14.3 agent, which was what was happening before, there is agreat chance things are back to normal for us
n
can I see the full
command
array you're passing ? or do you have to redact it im pretty sure
status code 2
usually means some incorrect command
n
Copy code
{
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:XXXXXX:task-definition/prefect-agent:26",
  "containerDefinitions": [
    {
      "name": "prefect-agent",
      "image": "prefecthq/prefect:2.14.3-python3.10",
      "cpu": 0,
      "portMappings": [],
      "essential": true,
      "command": [
        "prefect",
        "agent",
        "start",
        "XXXXX"
      ],
      "environment": [
        {
          "name": "EXTRA_PIP_PACKAGES",
          "value": "s3fs prefect-aws"
        },
        {
          "name": "ENV_NAME",
          "value": "PROD"
        },
        {
          "name": "PREFECT_API_KEY",
          "value": "XXXX"
        },
        {
          "name": "PREFECT_API_URL",
          "value": "XXXXX"
        },
        {
          "name": "PREFECT_API_ENABLE_HTTP2",
          "value": "False"
        },
        {
          "name": "PREFECT_LOGGING_LEVEL",
          "value": "INFO"
        }
      ],
      "mountPoints": [],
      "volumesFrom": [],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/ecs/prefect-agent",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ],
  "family": "prefect-agent",
  "taskRoleArn": "arn:aws:iam::XXXXX:role/PrefectECSRole",
  "executionRoleArn": "arn:aws:iam::XXXXX:role/prefect-ecs-task-execution-role",
  "networkMode": "awsvpc",
  "revision": 26,
  "volumes": [],
  "status": "ACTIVE",
  "requiresAttributes": [
    {
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "name": "ecs.capability.task-eni"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
    }
  ],
  "placementConstraints": [],
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "cpu": "256",
  "memory": "512",
  "registeredAt": "2023-12-13T21:47:04.357Z",
  "registeredBy": "arn:aws:iam::XXXXX:user/XXXXX",
  "tags": []
}
Updating the version to the newest one put us back on error code 1
n
sorry, if you can what the agent?
I guess if we can 2.14.3 agent, which was what was happening before, there is agreat chance things are back to normal for us
n
sorry, if we can go back and use the 2.14.3 agent
Since atm whenver we launch it, it auto updates to a most recent version
n
could you try this?
Copy code
"environment": [
                {
                    "name": "EXTRA_PIP_PACKAGES",
                    "value": "s3fs prefect-aws prefect==2.14.3"
                },
which might have to be in single quotes (can't remember) but basically im thinking prefect-aws is installing a new prefect on top of your image's prefect
n
testing it now
n
aha hold on
scratch that
instead, try
prefect-aws<0.4.6
n
Copy code
"environment": [
                {
                    "name": "EXTRA_PIP_PACKAGES",
                    "value": "s3fs prefect-aws<0.4.6"
                },
?
n
yea i cant remember if you need single quotes around the version spec like
"s3fs 'prefect-aws<0.4.6'"
but yeah thats what I meant
n
Got it! testing it now
I think that did the trick!!!! And it also explains why last week it was working but the deploy this week on the other environment crahed (prefect-aws was updated 2 days ago)
thanks thanks thanks
Cannot say thank you enough Nate. I'm so grateful for your time and help! The most i can do is here thre is a fantastic gif as a token of appreciation.
n
catjam glad you got it resolved 🎉
going forward, i would recommend checking out writing your own
Dockerfile
and building an image you push up to ECR that you can reference in your agent container definition so you can be sure you have the deps you expect. what we experienced today is one of the downsides of relying on
EXTRA_PIP_PACKAGES
to install stuff at runtime 🙂
n
You are right. It was convenient, but definitely payed the price for that convenience