Hello I am trying to set up an ECS agent running on an ECS c Prefect Community #ask-community

Hello, I am trying to set up an ECS agent running ...

Kathryn Klarich

07/15/2021, 3:59 PM

Hello, I am trying to set up an ECS agent running on an ECS cluster. However, when I run a test flow, the ECS task definition is created but it appears that it is set to [INACTIVE] immediately, so the task doesn't actually get started (it shows as running in the AWS console, but with the [INACTIVE] flag) and my flow doesn't get executed. I don't see any logs other than

Submitted for execution

and eventually after three retries it marks the flow as failed

A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.

, on agent side, the only log i see is

Deploying flow run

. It appears that the agent is creating the task definition, running the task and then immediately de-registering it (as seen here), which could be the problem as I don't see how the task can run if the definition is immediately de-registered before the task run is complete. I have looked through this issue, but I don't think this is the same problem because I am using prefect cloud. Any help is much appreciated.

nicholas

07/15/2021, 4:09 PM

Hi @Kathryn Klarich - is there anything in your cloudwatch logs that could be helpful?

Kathryn Klarich

07/15/2021, 4:36 PM

@nicholas the cloudwatch logging are the ones noted above, along with ECS cluster performance logs - i dug through those a little bit but didn't see anything of interest

Kathryn Klarich

07/15/2021, 4:38 PM

the only logs that are showing up in the ECS perfomance group are for the agent, not the tasks, so I don't think there is anything relevant there

Kathryn Klarich

07/15/2021, 4:39 PM

cloud watch logs for the agent are:

Copy code

2021-07-15T11:27:13.912-04:00	[2021-07-15 15:27:13,912] INFO - agent | Found 1 flow run(s) to submit for execution.

2021-07-15T11:27:14.140-04:00	[2021-07-15 15:27:14,139] INFO - agent | Deploying flow run '0baca4bd-ff7d-408e-9952-4f12044a83f9'

Zanie

07/15/2021, 4:48 PM

Not sure what's going on with your ECS setup, but I can confirm that de-registration is normal and should not affect the run. Can you turn on DEBUG level logs for your agent? There should be a bit more information that way:

Copy code

resp = self.ecs_client.run_task(taskDefinition=taskdef_arn, **kwargs)

        # Always deregister the task definition if a new one was registered
        if new_taskdef_arn:
            self.logger.debug("Deregistering task definition %s", taskdef_arn)
            self.ecs_client.deregister_task_definition(taskDefinition=taskdef_arn)

        if resp.get("tasks"):
            task_arn = resp["tasks"][0]["taskArn"]
            self.logger.debug("Started task %r for flow run %r", task_arn, flow_run.id)
            return f"Task {task_arn}"

        raise ValueError(
            "Failed to start task for flow run {0}. Failures: {1}".format(
                flow_run.id, resp.get("failures")
            )
        )

Kathryn Klarich

07/15/2021, 5:03 PM

@Zanie is there any way to add the logging level env var to an agent running in ECS? or do we have to tear the whole thing down and start over?

Kevin Kho

07/15/2021, 5:06 PM

I think you have to restart the process. Before you do that though, was the Flow working before you set up the ECS agent? Or has the Flow never run successfully? I think we can try spinning up the agent from local with

prefect agent ecs start

and then you add the log level there with

--log-level=DEBUG

and maybe we can debug from this agent?

Kevin Kho

07/15/2021, 5:08 PM

You may need to add labels on this agent and the flow to get the flow to be picked up here.

Kathryn Klarich

07/15/2021, 5:09 PM

@Kevin Kho yes the flow works when i deploy with a docker agent running on my laptop

Kathryn Klarich

07/15/2021, 5:10 PM

i like your idea of the local ecs agent

Kevin Kho

07/15/2021, 5:10 PM

Yeah lets try the ecs agent running on laptop

Kevin Kho

07/15/2021, 5:24 PM

It’s hard to know what is going on but just a couple of questions. 1. Are you using the default cluster or did you make one? 2. Are you sure you have the necessary permissions? 3. Are you using ECS/EC2/Fargate? 4. Are you trying any networking stuff? VPCs? 5. Where is the image being stored? Do you have credentials for that? 6. Are there potential quotas blocking the spin-up?

Kathryn Klarich

07/15/2021, 5:44 PM

@Kevin Kho we did get the debugging set up, submitting a task to ec2 appears to be successful. What we are pretty sure the issue is (based on logs retrieved by ssh'ing into the ec2 host and looking at the docker logs), is that the container being run inside of the task does not have permission / credentials to talk to the prefect cloud api, thus the requests are timing out. Is there documentation about how to provide prefect cloud token / credentials to the task started by the agent?

Kathryn Klarich

07/15/2021, 5:45 PM

Copy code

File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 504, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6268f5de10>, 'Connection to <http://api.prefect.io|api.prefect.io> timed out. (connect timeout=15)'))

Zanie

07/15/2021, 5:47 PM

The agent automatically passes auth through to the container; if auth is missing you'll get an unauthorized error rather than a timeout. This looks like a networking issue

Kathryn Klarich

07/15/2021, 6:43 PM

@Zanie - we ran some tests and agree that it is a networking issue. We have been creating a task definition like the one below, and specifying the networkConfiguration through the run_task_kwargs, but found that the container had no ability to do outbound networking. Thus we tried to switch to "bridge" mode in the task definition and removed the networkConfiguration (as it is not supported unless you use networkMode = awsvpc), but now we get an error:

An error occurred (InvalidParameterException) when calling the RunTask operation: Network Configuration is not valid for the given networkMode of this task definition.

It is strange that this is the error that we are getting because we shouldn't have to specify a networkConfiguration for networkMode = bridge.

Copy code

task_definition = yaml.safe_load(
      """
      networkMode: awsvpc     
      cpu: 1024
      memory: 1900
      containerDefinitions:
      - name: flow
      requiresCompatibilities:
      - EC2
      """
  )

Kathryn Klarich

07/15/2021, 6:44 PM

we are wondering if something is out of date, because we found this documentation about networkConfiguration from aws:

A networkConfiguration is now only specified for network mode awsvpc.

Kevin Kho

07/15/2021, 6:46 PM

There is an issue open for this. Check the code snippet in this link how to use the bridge network without having to create the whole task definition.

Kevin Kho

07/15/2021, 6:50 PM

Wait. Seems like that’s exactly what you were doing. I think the issue here is that “awsvpc” might be the default of ECSRun if you don’t specify anything. (Not 100% sure)

Kathryn Klarich

07/15/2021, 6:50 PM

yeah i have looked at this issue before

Kathryn Klarich

07/15/2021, 6:50 PM

and that is what we were trying

Kathryn Klarich

07/15/2021, 6:51 PM

right, but if we specify bridge mode in the task def, we get the error about

Network Configuration is not valid for the given networkMode of this task definition.

Kathryn Klarich

07/15/2021, 6:52 PM

but we shouldn't need to use a networkConfiguration for bridge mode (because it's only available for awsvpc)

Kevin Kho

07/15/2021, 6:55 PM

My bad. I see exactly what you’re saying now. You are right that it shouldn’t need to specify the network configuration. That’s weird. Let me check a previous thread I had with someone one sec.

Kevin Kho

07/15/2021, 6:57 PM

Are you still using the agent as a service? This person had to restart their service for that changes to be applied (though admittedly a different error). Are you using EC2 for the launch type?

Kathryn Klarich

07/15/2021, 7:03 PM

what do you mean by using the agent as a service?

Kathryn Klarich

07/15/2021, 7:03 PM

are are just running the agent as a task on an ecs cluster, then the agent submits flows as ecs tasks on the ecs cluster

Kathryn Klarich

07/15/2021, 7:04 PM

I think it would just be the standard method of a production deploy of ecs agent

Kevin Kho

07/15/2021, 7:06 PM

I meant like this . I think we’re talking about the same thing. You might have to change the networkMode on the

json

and restart?

Kathryn Klarich

07/15/2021, 7:12 PM

yes that is what i mean. we have a couple of questions - does the networkMode of the agent need to match the networkMode of the task definition? and does the 'requiresCompatibilities' of the agent need to match the 'requiresCompatibilities' of the task definition?

Kevin Kho

07/15/2021, 7:13 PM

Let me look at the source one sec.

Kevin Kho

07/15/2021, 7:22 PM

Looks like the

launchType

is taken from the agent and defaults to Fargate. So if you specify it on the RunConfig, it won’t override. So yes I think they need to match.

Kevin Kho

07/15/2021, 7:28 PM

The

launchType

follows the agent, but then

requiresCompability

is overridden by the RunConfig, otherwise it takes the launch type. I guess I don’t have a clear answer to whether they need to match. I’ll look at bit more.

Kathryn Klarich

07/15/2021, 8:07 PM

Ok, we tried switching to lauchType = Fargate and network mode = awsvpc as our agent is also running on fargate, but seem to have the same problem where the task is started, but immediately fails (possibly due to the same networking issue though it's hard to tell as we cant ssh into the fargate host)

Kevin Kho

07/15/2021, 8:18 PM

Could you give me your task definition and run config with sensitive stuff removed so I can test this?

Kathryn Klarich

07/15/2021, 8:44 PM

Copy code

task_definition = yaml.safe_load(
      """
      networkMode: awsvpc
      cpu: 1024
      memory: 2048
      containerDefinitions:
      - name: flow
      requiresCompatibilities:
      - EC2
      """
  )


RUN_CONFIG = ECSRun(run_task_kwargs = {'launchType': 'EC2', 'cluster': 'jupiter-eos-swift-prefect-dev', "networkConfiguration": {
        "awsvpcConfiguration": {
            "assignPublicIp": "DISABLED",
            "subnets": ["subnet-1", "subnet-2"],
            "securityGroups": ["sg-1"],
        }
    }},
                    task_role_arn='arn:aws:iam::my-task-role',
                    execution_role_arn='arn:aws:iam::my-task-execution_role',
                    task_definition=task_definition)

Kathryn Klarich

07/15/2021, 8:44 PM

our agent is running as a fargate agent also with aws vpc configuration

Kathryn Klarich

07/15/2021, 9:08 PM

ok, we got it to work!

Kathryn Klarich

07/15/2021, 9:10 PM

i copied all the network configurations from the agent to the task def

Kevin Kho

07/15/2021, 9:11 PM

Wow! I am so relieved you go this to work. Thanks for all the patience! I think this was a 2 week thing? Thanks for circling back, was just setting stuff up to test.

Zanie

07/15/2021, 9:13 PM

If someone can get me a summary of what the issue finally was and some action items I can try to get some improvements into the `ECSAgent`/`ECSRun`

Kathryn Klarich

07/16/2021, 1:38 PM

Yeah not completely out of the woods yet because now we are getting an error from ECS when we try to start a task:

CannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve ref

even though the flow and container haven't changed at all. We did add one permission to the task role, but didn't remove anything so right now I have no idea why it worked before and now it's not.

Kathryn Klarich

07/16/2021, 1:39 PM

@Zanie I'll get you a summary once we get it working again

Kevin Kho

07/16/2021, 1:45 PM

Is your image hosted on ECR or some other registry?

Kathryn Klarich

07/16/2021, 1:47 PM

it's on ECR

Kevin Kho

07/16/2021, 1:54 PM

When you got it working, did you log in to ECR?

Kathryn Klarich

07/16/2021, 2:01 PM

in the AWS console?

Kevin Kho

07/16/2021, 2:02 PM

Am not sure about the agent as a task but when use ECS to pull from ECR, I authenticate with something like

aws ecr get-login-password _--_region us-east-2 | docker login _--_username AWS _--_password-stdin <http://XXXXXXXX.dkr.ecr.us-east-2.amazonaws.com|XXXXXXXX.dkr.ecr.us-east-2.amazonaws.com>

Kathryn Klarich

07/16/2021, 2:15 PM

No i don't think we did that, we just gave permission to the task to pull an image from ECR

Kathryn Klarich

07/16/2021, 2:26 PM

I think possibly the permission got removed because i don't see it in the policies right now but need to wait for my colleagues in the west to wake up to change that

👍 1

Kathryn Klarich

07/16/2021, 9:15 PM

@Zanie @Kevin Kho Here is a summary of the issues I encountered: We are running the Prefect

ECSAgent

as a

Fargate

task on an ECS cluster. We use the agent to submit flows as ECS tasks (so pretty much a standard production setup) - I started by using default params for the

ECSRun

config, but got an error saying that I couldn't create

Fargate

tasks, thus I changed the launchType to

EC2

and then ran into this issue. However, when trying to use the fix that Kevin suggested, (i.e. setting

networkMode

bridge

), I would get the error:

Network Configuration is not valid for the given networkMode of this task definition.

However this error doesn't make sense as boto will only allow you to pass a

networkConfiguration

awsvpc

network mode, so we had to switch to

awsvpc

. However, then we were unable to connect to the prefect api, as the container being run inside of the task did not have any networking ability, thus the requests timed out. Since I knew the prefect agent was able to connect to the prefect api, I decided to copy all of the agent configs to the task definition, and magically it worked. Which leaves us with a couple of questions - why when using the networkMode

bridge

do we get an error about network configuration? Do the configurations and launchType need to be identical between the agent and the task it's trying to launch?

Kathryn Klarich

07/27/2021, 6:02 PM

@Zanie @Kevin Kho I was wondering if there was any update on getting the ECSAgent to be able to launch EC2 tasks (rather than Fargate)? We would still like to be able to launch EC2 tasks but I'm not sure how this is possible (re: the message above about requiring network configuration). Does the agent task also need to be running as an EC2 task if we want the agent to launch EC2 tasks?

Kevin Kho

07/27/2021, 6:04 PM

Hey @Kathryn Klarich, I’ll ping a community member who I know got this working on EC2. They are in a different timezone so they may take a while to respond.

Kathryn Klarich

07/27/2021, 8:01 PM

ok thanks @Kevin Kho

Ben Muller

07/27/2021, 8:15 PM

Hey @Kathryn Klarich, yes, you are correct. If you want to launch ec2 tasks (as flows). You need to have your agent running its service as an ec2 type service. The networking should suffice in the GH issue. One thing I would note is to specify in your agent the launch type argument of EC2 also. If you are having some issues can you share what errors you're getting and what your run config looks like so that we can potentially help out better?

Kathryn Klarich

08/04/2021, 5:18 PM

Hey @Ben Muller thanks for the info and sorry for the slow reply - I have been on vacation. We are currently running the agent as a Fargate task so it sounds like we will have to switch that. I'll try what you suggested and see if I run into any issues. Thanks again!

4 Views

Open in Slack

Previous Next