Hi all We re deploying the prefect ecs agent to Fargate Work Prefect Community #prefect-server

Hi all! We’re deploying the prefect ecs agent to F...

Ben Collier

04/28/2021, 11:11 AM

Hi all! We’re deploying the prefect ecs agent to Fargate. Works fine locally, but on Fargate we’re getting neither a successful health check nor any logs appearing in Cloudwatch. Bear in mind that we can deliberately bork the build and see logs. So I’m wondering if anyone else has had a similar experience. Would it be better to start with the docker version of the agent and try to get that working first?

ciaran

04/28/2021, 11:21 AM

Hi Ben, we've setup a full CDK'd ECS/Fargate Prefect Agent deployment, maybe this might help? https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L170-L183

ciaran

04/28/2021, 11:21 AM

https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L159 also might help your logging issue

Ben Collier

04/28/2021, 11:22 AM

Thank you, much appreciated

ciaran

04/28/2021, 11:23 AM

No problem, let me know how it goes!

ciaran

04/28/2021, 11:23 AM

We're actively developing that project so we're keen on making it useful reference for folks

Ben Collier

04/28/2021, 11:24 AM

I notice you set up quite a high memory limit on the task.

ciaran

04/28/2021, 11:25 AM

That could very well be dropped. I wouldn't take the memory settings we've got as gospel

ciaran

04/28/2021, 11:25 AM

😅

Ben Collier

04/28/2021, 11:30 AM

So it all looks fairly standard - were there any particular hurdles you had to jump over to get it working? I’m clutching at straws here as we’re very confident that the AWS environment is configured correctly but can’t seem to get the agent to start correctly or even send anything to CloudWatch.

ciaran

04/28/2021, 11:32 AM

Are you able to check the

Stopped

tasks on your service? You might be experiencing errors pulling images

ciaran

04/28/2021, 11:33 AM

Usually if you can't see anything obvious, the Task is the failure point

ciaran

04/28/2021, 11:33 AM

Unfortunately, without being able to click around, I can't offer much help - though if you're able to, I'm more than happy to jump on a call if you are able to screenshare

Ben Collier

04/28/2021, 11:34 AM

Thanks Ciaran, that’s very generous of you.

Ben Collier

04/28/2021, 11:34 AM

Let’s see if we have much luck - otherwise I might take you up on that offer.

ciaran

04/28/2021, 11:35 AM

Cool no worries! But yeah, my first go-to would be seeing if your service has a graveyard of

Stopped

tasks

ciaran

04/28/2021, 11:35 AM

It may very well be stuck in a loop of trying to stand it up

Ben Collier

04/28/2021, 11:36 AM

I think we’re likely to have serious problems with importing images due to the closed environment.

Ben Collier

04/28/2021, 11:36 AM

Do you know if the plain Docker agent will run alright on Fargate?

ciaran

04/28/2021, 11:39 AM

Hmmm I don't know. Where are the images located? Could you use ECR?

ciaran

04/28/2021, 11:39 AM

That's what we're doing

Ben Collier

04/28/2021, 11:39 AM

Yeah, we could put the images on ECR.

Ben Collier

04/28/2021, 11:40 AM

I’m slightly nonplussed by the lack of log output though. I’d expect to see at least an error.

ciaran

04/28/2021, 11:41 AM

If it can't start the task, there wont be logs

ciaran

04/28/2021, 11:41 AM

The Service itself wont log

Ben Collier

04/28/2021, 11:41 AM

Hang on a second, I think we’re getting the conversation back to front.

ciaran

04/28/2021, 11:41 AM

So if it's a pull error, it's not even getting to where Cloudwatch would have access to logs

Ben Collier

04/28/2021, 11:41 AM

Haaaang on

Ben Collier

04/28/2021, 11:42 AM

Sorry, no I misunderstood what you were getting at. We pull the image onto ECR and then launch it with a task on ECS and Fargate.

Ben Collier

04/28/2021, 11:42 AM

That’s fine. It stands up and then starts additional tasks on the cluster.

Ben Collier

04/28/2021, 11:43 AM

As that’s taking place, we see no logs, and are getting failed health checks.

Ben Collier

04/28/2021, 11:43 AM

Ordinarily that would point to a problem with security groups on the load balancer, but we’re certain those are configured correctly.

Ben Collier

04/28/2021, 11:43 AM

What the agent can’t do is call out to the internet in any way.

Ben Collier

04/28/2021, 11:44 AM

There’s no egress at the moment.

ciaran

04/28/2021, 11:44 AM

How are you deploying this?

Ben Collier

04/28/2021, 11:44 AM

So obviously there’ll be issues, but we were expecting to see logs.

Ben Collier

04/28/2021, 11:44 AM

It’s Terraform and Gitlab. With a docker push to ECR.

ciaran

04/28/2021, 11:45 AM

Does your Fargate Service have a public IP?

Ben Collier

04/28/2021, 11:45 AM

ciaran

04/28/2021, 11:46 AM

Wont that be why there's no Egress then?

Ben Collier

04/28/2021, 11:46 AM

There’s no Egress on purpose at the moment.

Ben Collier

04/28/2021, 11:47 AM

Like I said, the environment is very locked down, we expect there to be a failure when the agent tries to call out to Prefect Cloud, but what we’re not currently seeing are any logs, and the agent is failing a health check (which is fine, but again - no logs)

ciaran

04/28/2021, 11:48 AM

Ah right. And you've definitely set up a log driver? https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L159

ciaran

04/28/2021, 11:48 AM

Without this, I didn't get logs.

Ben Collier

04/28/2021, 11:48 AM

Yes, the environment for the container is identical to that which we’re using for other services, and they all log out without any problems.

ciaran

04/28/2021, 11:50 AM

It's on the container definition though, not the environment

Ben Collier

04/28/2021, 11:51 AM

ciaran

04/28/2021, 11:52 AM

Hmmm. I forget it's Terraform so you'll need more upfront declarations. Maybe the permissions around logging to Cloudwatch are incorrect?

ciaran

04/28/2021, 11:52 AM

My guess is CDK does some magic for me to give it rights to log out

Ben Collier

04/28/2021, 11:52 AM

They’re all default and the same as the others that work..

ciaran

04/28/2021, 11:54 AM

Hmmm.

ciaran

04/28/2021, 11:54 AM

At this point, I'm not sure then

Ben Collier

04/28/2021, 11:54 AM

Thanks for your help. It’s a bit of a head-scratcher.

Ben Collier

04/28/2021, 11:55 AM

We may try deploying the basic Docker version of the agent and seeing if that behaves differently.

Ben Collier

04/28/2021, 11:55 AM

It would be good to know if that’s a viable approach.

ciaran

04/28/2021, 11:56 AM

Not sure if it helps, but this is the ENTRYPOINT for our Agent image:

Copy code

ENTRYPOINT [ "prefect", "agent", "ecs", "start", "--agent-address", "http://:8080"]

Ben Collier

04/28/2021, 11:59 AM

We have something very similar.

ciaran

04/28/2021, 12:00 PM

We are then also passing in the ECS cluster arn and ECS Task Role arn

Ben Collier

04/28/2021, 12:00 PM

Likewise

Ben Collier

04/28/2021, 12:01 PM

Although in our case we’re starting it with a CMD

ciaran

04/28/2021, 12:03 PM

What image are you using for the Agent? Can you try just using a 'vanilla' prefect one? This is all we're using at the moment: https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/images/agent/Dockerfile

Ben Collier

04/28/2021, 12:05 PM

We just do a pip install prefect[aws] and run that.

ciaran

04/28/2021, 12:10 PM

Try out that image and see what it does

👍 1

Ben Collier

04/28/2021, 2:51 PM

Hi Ciaran - when you say you’re then also passing in those values, how are you achieving that?

Ben Collier

04/28/2021, 2:52 PM

Via env vars?

ciaran

04/28/2021, 2:52 PM

@Ben Collier The arns?

ciaran

04/28/2021, 2:53 PM

https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L162-L167 They're passed in as commands on the container definition

ciaran

04/28/2021, 2:53 PM

So that when the container starts up it does

<ENTRYPOINT> <COMMANDS>

ciaran

04/28/2021, 4:00 PM

Ben, I think that ECR actually requires you to have internet access

ciaran

04/28/2021, 4:00 PM

I seem to remember having this where my instances without public IPs couldn't pull images.

Ben Collier

04/28/2021, 4:01 PM

We can load up images from ECR on all our other instances.

Ben Collier

04/28/2021, 4:02 PM

And those have got identical configurations.

Ben Collier

04/28/2021, 4:02 PM

Except that they’re running apps.

Ben Collier

04/28/2021, 4:02 PM

Django, Node etc.

ciaran

04/28/2021, 4:03 PM

Hmmm okay. I know that you can add VPC Endpoints to allow your locked down networks to access ECR...

Ben Collier

04/28/2021, 4:04 PM

We’re in a tightly controlled environment. I think the cloud team may have put in endpoints for that purpose.

Ben Collier

04/28/2021, 4:04 PM

That would explain why we can do it.

Ben Collier

04/28/2021, 4:05 PM

Of course, that doesn’t explain why the task loads one version of the agent and then tries to load lots more and fails.

Ben Collier

04/28/2021, 4:05 PM

Is that a behaviour - does the agent load additional agents?

Ben Collier

04/28/2021, 4:05 PM

I can see it launching additional tasks.

ciaran

04/28/2021, 4:06 PM

No, that isn't the usual behaviour, ours only runs one task

Ben Collier

04/28/2021, 4:07 PM

Hmm

ciaran

04/28/2021, 4:07 PM

https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L175 Have you specified only 1 for the desired count

Ben Collier

04/28/2021, 4:08 PM

And that’s how ours is configured.

Ben Collier

04/28/2021, 4:08 PM

But then it looks to me like I’m seeing additional tasks firing off.

Ben Collier

04/28/2021, 4:09 PM

Which I assumed came from the agent.

Kevin Kho

04/28/2021, 4:13 PM

Hey, joining in late but just to summarize thus far 1. The tasks aren’t working. There are no successful health checks and cloudwatch logs. 2. The tasks are being duplicated. 3. You’re using Prefect Cloud? But being hit from a tightly controlled environment

Ben Collier

04/28/2021, 4:14 PM

Basically yes. We’ve proven the environment works, as it we bork the docker CMD in the Dockerfile on purpose we see error messages in Cloudwatch when the image is run by the task.

Ben Collier

04/28/2021, 4:14 PM

We appear to see multiple tasks being launched in ECS.

Ben Collier

04/28/2021, 4:15 PM

We’re using a pattern with Prefect Cloud and agents running on our own infrastructure. There is currently no egress, but I would expect the agent to at least log something to Cloudwatch when it launches, even if that’s just the Prefect ascii graphics that appear when it starts up.

Kevin Kho

04/28/2021, 4:18 PM

What do you think of adding an idempotency key to the flow and see if that prevents multiple ECS task creations?

Ben Collier

04/28/2021, 4:20 PM

I don’t think we’re getting to the point where that would be a thing. We’re literally just starting a basic agent up at this point and it has no flow associated with it. Unless I’m misunderstanding you.

Kevin Kho

04/28/2021, 4:21 PM

Oh I see what you mean

Kevin Kho

04/28/2021, 4:23 PM

Will grab someone on the team with more ECS experience

Ben Collier

04/28/2021, 4:24 PM

Thanks Kevin

Mariia Kerimova

04/28/2021, 4:46 PM

Hello! 👋 You can take a look at this PR with example how to setup agent running as AWS Service. But because you are using Prefect Server, I believe you need to update

PREFECT__CLOUD__API

environment variable(line 317) and point to your api, something like

http://<your_ip>:4200/graphql

Ben Collier

04/28/2021, 4:47 PM

Hi Mariia. We’re using Prefect Cloud, is that the same thing? Sorry, I’m a bit new to the world of Prefect.

Mariia Kerimova

04/28/2021, 4:50 PM

oops, I misunderstood. So as I understand, for Cloud agent needs to talk to

"<https://api.prefect.io>"

, and it's not possible to achieve according to your network policies, right?

Ben Collier

04/28/2021, 4:50 PM

That’s correct, for now - but I would expect the agent to at least log out errors to Cloudwatch, right?

Mariia Kerimova

04/28/2021, 4:55 PM

hmm, good question, I think you should see at least some error logs in CloudWatch, but let me test it to be sure

Ben Collier

04/28/2021, 4:58 PM

Thanks Mariia

Ben Collier

04/28/2021, 4:58 PM

I doubt the agent has any of those roles available to it. Like I said, we’re in a very restricted environment. I’ll try giving it all those roles tomorrow and see what happens! An error, I expect.

Mariia Kerimova

04/28/2021, 5:01 PM

oh, yeah, for using CloudWatch, you definitely need to provide execution role with Cloud Watch permissions, otherwise agent will be prohibited to use CloudWatch api

Ben Collier

04/28/2021, 5:02 PM

Mariia, can you confirm for me - does the agent kick off additional tasks in the cluster once it’s running?

Mariia Kerimova

04/28/2021, 5:03 PM

yes, you are correct, this is a purpose of agent, it kicks off tasks for your flow runs

Ben Collier

04/28/2021, 5:07 PM

Okay, thanks. Then I guess the question is whether we would expect it to just bomb out if it didn’t have the right permissions.

Mariia Kerimova

04/28/2021, 5:14 PM

So, after you register task definition for agent, if you don’t have right permissions in Events you’ll see something like this:

Ben Collier

04/28/2021, 5:21 PM

We just tested it with no permissions at all and the agent bombed out with a stack trace.

Mariia Kerimova

04/28/2021, 5:33 PM

you mean you didn’t provide task role or execution role for agent, and it worked in ecs(not locally on your machine) and you can see logs in CloudWatch?

6 Views

Open in Slack

Previous Next