Hello, I am trying to set up an ECS agent running ...
# ask-community
k
Hello, I am trying to set up an ECS agent running on an ECS cluster. However, when I run a test flow, the ECS task definition is created but it appears that it is set to [INACTIVE] immediately, so the task doesn't actually get started (it shows as running in the AWS console, but with the [INACTIVE] flag) and my flow doesn't get executed. I don't see any logs other than
Submitted for execution
and eventually after three retries it marks the flow as failed
A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
, on agent side, the only log i see is
Deploying flow run
. It appears that the agent is creating the task definition, running the task and then immediately de-registering it (as seen here), which could be the problem as I don't see how the task can run if the definition is immediately de-registered before the task run is complete. I have looked through this issue, but I don't think this is the same problem because I am using prefect cloud. Any help is much appreciated.
n
Hi @Kathryn Klarich - is there anything in your cloudwatch logs that could be helpful?
k
@nicholas the cloudwatch logging are the ones noted above, along with ECS cluster performance logs - i dug through those a little bit but didn't see anything of interest
the only logs that are showing up in the ECS perfomance group are for the agent, not the tasks, so I don't think there is anything relevant there
cloud watch logs for the agent are:
Copy code
2021-07-15T11:27:13.912-04:00	[2021-07-15 15:27:13,912] INFO - agent | Found 1 flow run(s) to submit for execution.

2021-07-15T11:27:14.140-04:00	[2021-07-15 15:27:14,139] INFO - agent | Deploying flow run '0baca4bd-ff7d-408e-9952-4f12044a83f9'
z
Not sure what's going on with your ECS setup, but I can confirm that de-registration is normal and should not affect the run. Can you turn on DEBUG level logs for your agent? There should be a bit more information that way:
Copy code
resp = self.ecs_client.run_task(taskDefinition=taskdef_arn, **kwargs)

        # Always deregister the task definition if a new one was registered
        if new_taskdef_arn:
            self.logger.debug("Deregistering task definition %s", taskdef_arn)
            self.ecs_client.deregister_task_definition(taskDefinition=taskdef_arn)

        if resp.get("tasks"):
            task_arn = resp["tasks"][0]["taskArn"]
            self.logger.debug("Started task %r for flow run %r", task_arn, flow_run.id)
            return f"Task {task_arn}"

        raise ValueError(
            "Failed to start task for flow run {0}. Failures: {1}".format(
                flow_run.id, resp.get("failures")
            )
        )
k
@Zanie is there any way to add the logging level env var to an agent running in ECS? or do we have to tear the whole thing down and start over?
k
I think you have to restart the process. Before you do that though, was the Flow working before you set up the ECS agent? Or has the Flow never run successfully? I think we can try spinning up the agent from local with
prefect agent ecs start
and then you add the log level there with
--log-level=DEBUG
and maybe we can debug from this agent?
You may need to add labels on this agent and the flow to get the flow to be picked up here.
k
@Kevin Kho yes the flow works when i deploy with a docker agent running on my laptop
i like your idea of the local ecs agent
k
Yeah lets try the ecs agent running on laptop
It’s hard to know what is going on but just a couple of questions. 1. Are you using the default cluster or did you make one? 2. Are you sure you have the necessary permissions? 3. Are you using ECS/EC2/Fargate? 4. Are you trying any networking stuff? VPCs? 5. Where is the image being stored? Do you have credentials for that? 6. Are there potential quotas blocking the spin-up?
k
@Kevin Kho we did get the debugging set up, submitting a task to ec2 appears to be successful. What we are pretty sure the issue is (based on logs retrieved by ssh'ing into the ec2 host and looking at the docker logs), is that the container being run inside of the task does not have permission / credentials to talk to the prefect cloud api, thus the requests are timing out. Is there documentation about how to provide prefect cloud token / credentials to the task started by the agent?
Copy code
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 504, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6268f5de10>, 'Connection to <http://api.prefect.io|api.prefect.io> timed out. (connect timeout=15)'))
z
The agent automatically passes auth through to the container; if auth is missing you'll get an unauthorized error rather than a timeout. This looks like a networking issue
k
@Zanie - we ran some tests and agree that it is a networking issue. We have been creating a task definition like the one below, and specifying the networkConfiguration through the run_task_kwargs, but found that the container had no ability to do outbound networking. Thus we tried to switch to "bridge" mode in the task definition and removed the networkConfiguration (as it is not supported unless you use networkMode = awsvpc), but now we get an error:
An error occurred (InvalidParameterException) when calling the RunTask operation: Network Configuration is not valid for the given networkMode of this task definition.
It is strange that this is the error that we are getting because we shouldn't have to specify a networkConfiguration for networkMode = bridge.
Copy code
task_definition = yaml.safe_load(
      """
      networkMode: awsvpc     
      cpu: 1024
      memory: 1900
      containerDefinitions:
      - name: flow
      requiresCompatibilities:
      - EC2
      """
  )
we are wondering if something is out of date, because we found this documentation about networkConfiguration from aws:
A networkConfiguration is now only specified for network mode awsvpc.
k
There is an issue open for this. Check the code snippet in this link how to use the bridge network without having to create the whole task definition.
Wait. Seems like that’s exactly what you were doing. I think the issue here is that “awsvpc” might be the default of ECSRun if you don’t specify anything. (Not 100% sure)
k
yeah i have looked at this issue before
and that is what we were trying
right, but if we specify bridge mode in the task def, we get the error about
Network Configuration is not valid for the given networkMode of this task definition.
but we shouldn't need to use a networkConfiguration for bridge mode (because it's only available for awsvpc)
k
My bad. I see exactly what you’re saying now. You are right that it shouldn’t need to specify the network configuration. That’s weird. Let me check a previous thread I had with someone one sec.
Are you still using the agent as a service? This person had to restart their service for that changes to be applied (though admittedly a different error). Are you using EC2 for the launch type?
k
what do you mean by using the agent as a service?
are are just running the agent as a task on an ecs cluster, then the agent submits flows as ecs tasks on the ecs cluster
I think it would just be the standard method of a production deploy of ecs agent
k
I meant like this . I think we’re talking about the same thing. You might have to change the networkMode on the
json
and restart?
k
yes that is what i mean. we have a couple of questions - does the networkMode of the agent need to match the networkMode of the task definition? and does the 'requiresCompatibilities' of the agent need to match the 'requiresCompatibilities' of the task definition?
k
Let me look at the source one sec.
Looks like the
launchType
is taken from the agent and defaults to Fargate. So if you specify it on the RunConfig, it won’t override. So yes I think they need to match.
The
launchType
follows the agent, but then
requiresCompability
is overridden by the RunConfig, otherwise it takes the launch type. I guess I don’t have a clear answer to whether they need to match. I’ll look at bit more.
k
Ok, we tried switching to lauchType = Fargate and network mode = awsvpc as our agent is also running on fargate, but seem to have the same problem where the task is started, but immediately fails (possibly due to the same networking issue though it's hard to tell as we cant ssh into the fargate host)
k
Could you give me your task definition and run config with sensitive stuff removed so I can test this?
k
Copy code
task_definition = yaml.safe_load(
      """
      networkMode: awsvpc
      cpu: 1024
      memory: 2048
      containerDefinitions:
      - name: flow
      requiresCompatibilities:
      - EC2
      """
  )


RUN_CONFIG = ECSRun(run_task_kwargs = {'launchType': 'EC2', 'cluster': 'jupiter-eos-swift-prefect-dev', "networkConfiguration": {
        "awsvpcConfiguration": {
            "assignPublicIp": "DISABLED",
            "subnets": ["subnet-1", "subnet-2"],
            "securityGroups": ["sg-1"],
        }
    }},
                    task_role_arn='arn:aws:iam::my-task-role',
                    execution_role_arn='arn:aws:iam::my-task-execution_role',
                    task_definition=task_definition)
our agent is running as a fargate agent also with aws vpc configuration
ok, we got it to work!
i copied all the network configurations from the agent to the task def
k
Wow! I am so relieved you go this to work. Thanks for all the patience! I think this was a 2 week thing? Thanks for circling back, was just setting stuff up to test.
z
If someone can get me a summary of what the issue finally was and some action items I can try to get some improvements into the `ECSAgent`/`ECSRun`
k
Yeah not completely out of the woods yet because now we are getting an error from ECS when we try to start a task:
CannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve ref
even though the flow and container haven't changed at all. We did add one permission to the task role, but didn't remove anything so right now I have no idea why it worked before and now it's not.
@Zanie I'll get you a summary once we get it working again
k
Is your image hosted on ECR or some other registry?
k
it's on ECR
k
When you got it working, did you log in to ECR?
k
in the AWS console?
k
Am not sure about the agent as a task but when use ECS to pull from ECR, I authenticate with something like
aws ecr get-login-password _--_region us-east-2 | docker login _--_username AWS _--_password-stdin <http://XXXXXXXX.dkr.ecr.us-east-2.amazonaws.com|XXXXXXXX.dkr.ecr.us-east-2.amazonaws.com>
k
No i don't think we did that, we just gave permission to the task to pull an image from ECR
I think possibly the permission got removed because i don't see it in the policies right now but need to wait for my colleagues in the west to wake up to change that
👍 1
@Zanie @Kevin Kho Here is a summary of the issues I encountered: We are running the Prefect
ECSAgent
as a
Fargate
task on an ECS cluster. We use the agent to submit flows as ECS tasks (so pretty much a standard production setup) - I started by using default params for the
ECSRun
config, but got an error saying that I couldn't create
Fargate
tasks, thus I changed the launchType to
EC2
and then ran into this issue. However, when trying to use the fix that Kevin suggested, (i.e. setting
networkMode
to
bridge
), I would get the error:
Network Configuration is not valid for the given networkMode of this task definition.
However this error doesn't make sense as boto will only allow you to pass a
networkConfiguration
to
awsvpc
network mode, so we had to switch to
awsvpc
. However, then we were unable to connect to the prefect api, as the container being run inside of the task did not have any networking ability, thus the requests timed out. Since I knew the prefect agent was able to connect to the prefect api, I decided to copy all of the agent configs to the task definition, and magically it worked. Which leaves us with a couple of questions - why when using the networkMode
bridge
do we get an error about network configuration? Do the configurations and launchType need to be identical between the agent and the task it's trying to launch?
@Zanie @Kevin Kho I was wondering if there was any update on getting the ECSAgent to be able to launch EC2 tasks (rather than Fargate)? We would still like to be able to launch EC2 tasks but I'm not sure how this is possible (re: the message above about requiring network configuration). Does the agent task also need to be running as an EC2 task if we want the agent to launch EC2 tasks?
k
Hey @Kathryn Klarich, I’ll ping a community member who I know got this working on EC2. They are in a different timezone so they may take a while to respond.
k
ok thanks @Kevin Kho
b
Hey @Kathryn Klarich, yes, you are correct. If you want to launch ec2 tasks (as flows). You need to have your agent running its service as an ec2 type service. The networking should suffice in the GH issue. One thing I would note is to specify in your agent the launch type argument of EC2 also. If you are having some issues can you share what errors you're getting and what your run config looks like so that we can potentially help out better?
k
Hey @Ben Muller thanks for the info and sorry for the slow reply - I have been on vacation. We are currently running the agent as a Fargate task so it sounds like we will have to switch that. I'll try what you suggested and see if I run into any issues. Thanks again!