Hi marvin we are getting an error from the prefect agent wit Prefect Community #ask-marvin

Hi marvin, we are getting an error from the prefe...

Nicolas Zapata

12/13/2023, 7:23 PM

Hi marvin, we are getting an error from the prefect agent with the following text: "reported flow run as crashed: Flow run infrastructure exited with non-zero status code 2" What can we do about it?

Nate

12/13/2023, 7:24 PM

Hi @Marvin, we are getting an error from the prefect agent with the following text: "reported flow run as crashed: Flow run infrastructure exited with non-zero status code 2" What can we do about it?

Marvin

12/13/2023, 7:25 PM

When an agent reports that a flow run has crashed with a non-zero status code like

, this often means that there was an issue with the flow run execution infrastructure. Here are some steps you can take to troubleshoot the problem: 1. Check the agent logs to see if anything suspicious stands out there. 2. Verify that your execution layer is able to pull your flow run’s image. For instance, if the image needs to be pulled from a container registry, ensure your execution environment can reach the internet and has the appropriate permissions to pull the image. 3. Ensure that your execution layer has the necessary permissions to spin up the required resources. This could involve things like IAM roles or a valid Prefect API key. 4. Confirm that your execution layer has enough capacity on the cluster to deploy your flow run. If there's insufficient capacity or resources, the run might not start as expected. If these checks don't reveal any issues, you may want to delve into more detailed logs or system metrics to see if there are any other clues as to why the flow run failed to execute properly. Additional information can be found on Prefect's Discourse.

Nicolas Zapata

12/13/2023, 7:31 PM

This did not solve the issue. I have 2 identical environments (production and staging) and in staging it runs without a problem but in production i am having this error but no additional logs

Nate

12/13/2023, 7:32 PM

what do the agent logs show? (not the flow run logs)

Nicolas Zapata

12/13/2023, 7:36 PM

Flow logs on prefect web are empty. Agent logs only shows the exited with non-zero exit code 2

Nicolas Zapata

12/13/2023, 7:54 PM

Prefect version of agent and task is 2.10.3

Nate

12/13/2023, 7:56 PM

in staging do you have a newer version? this might be another case of this

Nicolas Zapata

12/13/2023, 7:56 PM

Same version 😕

Nate

12/13/2023, 7:59 PM

hmm im not sure off the top then w/o more info, was there something that changed recently where this started happening? can you show how you're creating deployments?

Nicolas Zapata

12/13/2023, 8:02 PM

It's funny, we were using the prefect-2 latest in another project and we hit the execute error 2 weeks ago when an egent crashed and pulled the latest dependency, pinning the version to an older one solved the issue for that. In here we thought it would be the same, so when the flows started crashing last week in staging we pinned the agent's version to 2.14.3 and it started working normally. we did the same on prod on Monday when it failed but it didn't work for us here.

Nicolas Zapata

12/13/2023, 8:03 PM

Both configs have the exact same flows and AWS resources and use the same image for the executor

Nicolas Zapata

12/13/2023, 8:03 PM

Any help is greatly appreciated. Any idea how could we get better logging or more clues on this issue?

Nicolas Zapata

12/13/2023, 8:05 PM

For creating deploys we have a python file where we deploy flows for both prod and staging, and this is ran by github actions

Nate

12/13/2023, 8:15 PM

that deploy script looks normal to me if I were debugging this I'd want to test by running my ecs agent locally on newest prefect, then maybe the older version you're on, and see if its some credentials issue or perhaps a dependency issue

Nicolas Zapata

12/13/2023, 8:19 PM

Got it. They are running locally well with the old version

Nicolas Zapata

12/13/2023, 8:20 PM

I'll try with the newest one

Nicolas Zapata

12/13/2023, 11:09 PM

It still runs fine in local. i updated the prod and staging versions to run the latest version (2.14.10) on the executor and task and now it crashes on both places with code 1. No logs still

Nate

12/13/2023, 11:27 PM

when you say it runs fine locally, you mean that you're running your agent locally, and its submitting jobs to ECS, right? that's what I meant by > running my ecs agent locally

Nate

12/13/2023, 11:29 PM

because if this is true, it seems almost certain to be some credentials problem with the ECS service that's running your agent

Nate

12/13/2023, 11:29 PM

like it cant pull code from s3 or something (although I would expect a better error for that specific problem)

Nicolas Zapata

12/13/2023, 11:52 PM

Locally i don't submit jobs to ECS, i run them in a local container

Nate

12/13/2023, 11:58 PM

gotcha, the reason I suggested this > if I were debugging this I'd want to test by running my ecs agent locally on newest prefect, then maybe the older version you're on, and see if its some credentials issue or perhaps a dependency issue was to suggest you try to narrow the potential CoE by seeing if there's some permissions problem with the ECS service that runs your agent (not your flow run containers) at least, given > i run them in a local container you know its probably not a dependency issue, since I assume your local docker build was probably similar enough to your ECS container builds for flow runs

Nicolas Zapata

12/14/2023, 12:05 AM

Yep. That's how i feel. I mean, there were no credential changes in any of the deployments and staging env was executing jobs without issue, so i don't feel that's the actual issue

Nicolas Zapata

12/14/2023, 12:07 AM

I am checking every credential nervetheless

Nate

12/14/2023, 12:07 AM

how exactly are you running your ECS agent? based on the info i have, thats what i’m most suspicious of right now

Nicolas Zapata

12/14/2023, 12:09 AM

I'll respond that in a moment, just one question that came up while trying to boot things up with an older version of prefect. The agent automatically updated itself on loading, By any chance do you know how can we prevent an agent from upgrading itself when running from the prefect docker image?

Nicolas Zapata

12/14/2023, 12:10 AM

(BTW, i really appreciate all the help you have provided me today, thanks!)

Nate

12/14/2023, 12:13 AM

no problem! > how can we prevent an agent from upgrading itself i think again the most helpful piece of info on this question is to know how exactly you're running the agent i assume its something like step 4 here? but only for agents, of course

Nicolas Zapata

12/14/2023, 12:13 AM

In regards to how are we running our ECS agent, we are are booting it up from a JSON definition.

Nate

12/14/2023, 12:14 AM

why is

"cpu": 0

Nicolas Zapata

12/14/2023, 12:16 AM

Everything else in that file are simply environment variables

Nicolas Zapata

12/14/2023, 12:16 AM

I do not know the answer to that to ne honest

Nate

12/14/2023, 12:17 AM

hmm that is odd, I would expect 0 to be an invalid input, which I would think would error out before the container definition can even be registered with ECS. I usually expect to see 512 or 1024 there

Nicolas Zapata

12/14/2023, 12:17 AM

else where in the file we are defining cpu: 512

👍 1

Nicolas Zapata

12/14/2023, 12:18 AM

Nicolas Zapata

12/14/2023, 12:20 AM

I guess if we can 2.14.3 agent, which was what was happening before, there is agreat chance things are back to normal for us

Nate

12/14/2023, 12:22 AM

can I see the full

command

array you're passing ? or do you have to redact it im pretty sure

status code 2

usually means some incorrect command

Nicolas Zapata

12/14/2023, 12:25 AM

Copy code

{
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:XXXXXX:task-definition/prefect-agent:26",
  "containerDefinitions": [
    {
      "name": "prefect-agent",
      "image": "prefecthq/prefect:2.14.3-python3.10",
      "cpu": 0,
      "portMappings": [],
      "essential": true,
      "command": [
        "prefect",
        "agent",
        "start",
        "XXXXX"
      ],
      "environment": [
        {
          "name": "EXTRA_PIP_PACKAGES",
          "value": "s3fs prefect-aws"
        },
        {
          "name": "ENV_NAME",
          "value": "PROD"
        },
        {
          "name": "PREFECT_API_KEY",
          "value": "XXXX"
        },
        {
          "name": "PREFECT_API_URL",
          "value": "XXXXX"
        },
        {
          "name": "PREFECT_API_ENABLE_HTTP2",
          "value": "False"
        },
        {
          "name": "PREFECT_LOGGING_LEVEL",
          "value": "INFO"
        }
      ],
      "mountPoints": [],
      "volumesFrom": [],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/ecs/prefect-agent",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ],
  "family": "prefect-agent",
  "taskRoleArn": "arn:aws:iam::XXXXX:role/PrefectECSRole",
  "executionRoleArn": "arn:aws:iam::XXXXX:role/prefect-ecs-task-execution-role",
  "networkMode": "awsvpc",
  "revision": 26,
  "volumes": [],
  "status": "ACTIVE",
  "requiresAttributes": [
    {
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "name": "ecs.capability.task-eni"
    },
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
    }
  ],
  "placementConstraints": [],
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "cpu": "256",
  "memory": "512",
  "registeredAt": "2023-12-13T21:47:04.357Z",
  "registeredBy": "arn:aws:iam::XXXXX:user/XXXXX",
  "tags": []
}

Nicolas Zapata

12/14/2023, 12:26 AM

Updating the version to the newest one put us back on error code 1

Nate

12/14/2023, 12:32 AM

sorry, if you can what the agent?

I guess if we can 2.14.3 agent, which was what was happening before, there is agreat chance things are back to normal for us

Nicolas Zapata

12/14/2023, 12:35 AM

sorry, if we can go back and use the 2.14.3 agent

Nicolas Zapata

12/14/2023, 12:35 AM

Since atm whenver we launch it, it auto updates to a most recent version

Nate

12/14/2023, 12:39 AM

could you try this?

Copy code

"environment": [
                {
                    "name": "EXTRA_PIP_PACKAGES",
                    "value": "s3fs prefect-aws prefect==2.14.3"
                },

which might have to be in single quotes (can't remember) but basically im thinking prefect-aws is installing a new prefect on top of your image's prefect

Nicolas Zapata

12/14/2023, 12:39 AM

testing it now

Nate

12/14/2023, 12:39 AM

aha hold on

Nate

12/14/2023, 12:39 AM

https://github.com/PrefectHQ/prefect-aws/commit/6aca613bf901d04541f53dc61874b732051448eb#diff-4d7c51b1efe9043e44439a9[…]b34082903477fd04876edb7552L5

Nate

12/14/2023, 12:40 AM

scratch that

Nate

12/14/2023, 12:40 AM

instead, try

prefect-aws<0.4.6

Nicolas Zapata

12/14/2023, 12:41 AM

Copy code

"environment": [
                {
                    "name": "EXTRA_PIP_PACKAGES",
                    "value": "s3fs prefect-aws<0.4.6"
                },

Nate

12/14/2023, 12:42 AM

yea i cant remember if you need single quotes around the version spec like

"s3fs 'prefect-aws<0.4.6'"

but yeah thats what I meant

Nicolas Zapata

12/14/2023, 12:42 AM

Got it! testing it now

Nicolas Zapata

12/14/2023, 1:02 AM

I think that did the trick!!!! And it also explains why last week it was working but the deploy this week on the other environment crahed (prefect-aws was updated 2 days ago)

Nicolas Zapata

12/14/2023, 1:06 AM

thanks thanks thanks

Nicolas Zapata

12/14/2023, 1:06 AM

Cannot say thank you enough Nate. I'm so grateful for your time and help! The most i can do is here thre is a fantastic gif as a token of appreciation.

Nate

12/14/2023, 1:07 AM

catjam glad you got it resolved 🎉

Nate

12/14/2023, 1:09 AM

going forward, i would recommend checking out writing your own

Dockerfile

and building an image you push up to ECR that you can reference in your agent container definition so you can be sure you have the deps you expect. what we experienced today is one of the downsides of relying on

EXTRA_PIP_PACKAGES

to install stuff at runtime 🙂

Nicolas Zapata

12/14/2023, 1:13 AM

You are right. It was convenient, but definitely payed the price for that convenience

6 Views

Open in Slack

Previous Next