https://prefect.io logo
Title
b

Ben Collier

04/28/2021, 11:11 AM
Hi all! We’re deploying the prefect ecs agent to Fargate. Works fine locally, but on Fargate we’re getting neither a successful health check nor any logs appearing in Cloudwatch. Bear in mind that we can deliberately bork the build and see logs. So I’m wondering if anyone else has had a similar experience. Would it be better to start with the docker version of the agent and try to get that working first?
c

ciaran

04/28/2021, 11:21 AM
Hi Ben, we've setup a full CDK'd ECS/Fargate Prefect Agent deployment, maybe this might help? https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L170-L183
b

Ben Collier

04/28/2021, 11:22 AM
Thank you, much appreciated
c

ciaran

04/28/2021, 11:23 AM
No problem, let me know how it goes!
We're actively developing that project so we're keen on making it useful reference for folks
b

Ben Collier

04/28/2021, 11:24 AM
I notice you set up quite a high memory limit on the task.
c

ciaran

04/28/2021, 11:25 AM
That could very well be dropped. I wouldn't take the memory settings we've got as gospel
😅
b

Ben Collier

04/28/2021, 11:30 AM
So it all looks fairly standard - were there any particular hurdles you had to jump over to get it working? I’m clutching at straws here as we’re very confident that the AWS environment is configured correctly but can’t seem to get the agent to start correctly or even send anything to CloudWatch.
c

ciaran

04/28/2021, 11:32 AM
Are you able to check the
Stopped
tasks on your service? You might be experiencing errors pulling images
Usually if you can't see anything obvious, the Task is the failure point
Unfortunately, without being able to click around, I can't offer much help - though if you're able to, I'm more than happy to jump on a call if you are able to screenshare
b

Ben Collier

04/28/2021, 11:34 AM
Thanks Ciaran, that’s very generous of you.
Let’s see if we have much luck - otherwise I might take you up on that offer.
c

ciaran

04/28/2021, 11:35 AM
Cool no worries! But yeah, my first go-to would be seeing if your service has a graveyard of
Stopped
tasks
It may very well be stuck in a loop of trying to stand it up
b

Ben Collier

04/28/2021, 11:36 AM
I think we’re likely to have serious problems with importing images due to the closed environment.
Do you know if the plain Docker agent will run alright on Fargate?
c

ciaran

04/28/2021, 11:39 AM
Hmmm I don't know. Where are the images located? Could you use ECR?
That's what we're doing
b

Ben Collier

04/28/2021, 11:39 AM
Yeah, we could put the images on ECR.
I’m slightly nonplussed by the lack of log output though. I’d expect to see at least an error.
c

ciaran

04/28/2021, 11:41 AM
If it can't start the task, there wont be logs
The Service itself wont log
b

Ben Collier

04/28/2021, 11:41 AM
Hang on a second, I think we’re getting the conversation back to front.
c

ciaran

04/28/2021, 11:41 AM
So if it's a pull error, it's not even getting to where Cloudwatch would have access to logs
b

Ben Collier

04/28/2021, 11:41 AM
Haaaang on
Sorry, no I misunderstood what you were getting at. We pull the image onto ECR and then launch it with a task on ECS and Fargate.
That’s fine. It stands up and then starts additional tasks on the cluster.
As that’s taking place, we see no logs, and are getting failed health checks.
Ordinarily that would point to a problem with security groups on the load balancer, but we’re certain those are configured correctly.
What the agent can’t do is call out to the internet in any way.
There’s no egress at the moment.
c

ciaran

04/28/2021, 11:44 AM
How are you deploying this?
b

Ben Collier

04/28/2021, 11:44 AM
So obviously there’ll be issues, but we were expecting to see logs.
It’s Terraform and Gitlab. With a docker push to ECR.
c

ciaran

04/28/2021, 11:45 AM
Does your Fargate Service have a public IP?
b

Ben Collier

04/28/2021, 11:45 AM
No
c

ciaran

04/28/2021, 11:46 AM
Wont that be why there's no Egress then?
b

Ben Collier

04/28/2021, 11:46 AM
There’s no Egress on purpose at the moment.
Like I said, the environment is very locked down, we expect there to be a failure when the agent tries to call out to Prefect Cloud, but what we’re not currently seeing are any logs, and the agent is failing a health check (which is fine, but again - no logs)
c

ciaran

04/28/2021, 11:48 AM
Without this, I didn't get logs.
b

Ben Collier

04/28/2021, 11:48 AM
Yes, the environment for the container is identical to that which we’re using for other services, and they all log out without any problems.
c

ciaran

04/28/2021, 11:50 AM
It's on the container definition though, not the environment
b

Ben Collier

04/28/2021, 11:51 AM
c

ciaran

04/28/2021, 11:52 AM
Hmmm. I forget it's Terraform so you'll need more upfront declarations. Maybe the permissions around logging to Cloudwatch are incorrect?
My guess is CDK does some magic for me to give it rights to log out
b

Ben Collier

04/28/2021, 11:52 AM
They’re all default and the same as the others that work..
c

ciaran

04/28/2021, 11:54 AM
Hmmm.
At this point, I'm not sure then
b

Ben Collier

04/28/2021, 11:54 AM
Thanks for your help. It’s a bit of a head-scratcher.
We may try deploying the basic Docker version of the agent and seeing if that behaves differently.
It would be good to know if that’s a viable approach.
c

ciaran

04/28/2021, 11:56 AM
Not sure if it helps, but this is the ENTRYPOINT for our Agent image:
ENTRYPOINT [ "prefect", "agent", "ecs", "start", "--agent-address", "http://:8080"]
b

Ben Collier

04/28/2021, 11:59 AM
We have something very similar.
c

ciaran

04/28/2021, 12:00 PM
We are then also passing in the ECS cluster arn and ECS Task Role arn
b

Ben Collier

04/28/2021, 12:00 PM
Likewise
Although in our case we’re starting it with a CMD
c

ciaran

04/28/2021, 12:03 PM
What image are you using for the Agent? Can you try just using a 'vanilla' prefect one? This is all we're using at the moment: https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/images/agent/Dockerfile
b

Ben Collier

04/28/2021, 12:05 PM
We just do a pip install prefect[aws] and run that.
c

ciaran

04/28/2021, 12:10 PM
Try out that image and see what it does
👍 1
b

Ben Collier

04/28/2021, 2:51 PM
Hi Ciaran - when you say you’re then also passing in those values, how are you achieving that?
Via env vars?
c

ciaran

04/28/2021, 2:52 PM
@Ben Collier The arns?
So that when the container starts up it does
<ENTRYPOINT> <COMMANDS>
Ben, I think that ECR actually requires you to have internet access
I seem to remember having this where my instances without public IPs couldn't pull images.
b

Ben Collier

04/28/2021, 4:01 PM
We can load up images from ECR on all our other instances.
And those have got identical configurations.
Except that they’re running apps.
Django, Node etc.
c

ciaran

04/28/2021, 4:03 PM
Hmmm okay. I know that you can add VPC Endpoints to allow your locked down networks to access ECR...
b

Ben Collier

04/28/2021, 4:04 PM
We’re in a tightly controlled environment. I think the cloud team may have put in endpoints for that purpose.
That would explain why we can do it.
Of course, that doesn’t explain why the task loads one version of the agent and then tries to load lots more and fails.
Is that a behaviour - does the agent load additional agents?
I can see it launching additional tasks.
c

ciaran

04/28/2021, 4:06 PM
No, that isn't the usual behaviour, ours only runs one task
b

Ben Collier

04/28/2021, 4:07 PM
Hmm
c

ciaran

04/28/2021, 4:07 PM
b

Ben Collier

04/28/2021, 4:08 PM
And that’s how ours is configured.
But then it looks to me like I’m seeing additional tasks firing off.
Which I assumed came from the agent.
k

Kevin Kho

04/28/2021, 4:13 PM
Hey, joining in late but just to summarize thus far 1. The tasks aren’t working. There are no successful health checks and cloudwatch logs. 2. The tasks are being duplicated. 3. You’re using Prefect Cloud? But being hit from a tightly controlled environment
b

Ben Collier

04/28/2021, 4:14 PM
Basically yes. We’ve proven the environment works, as it we bork the docker CMD in the Dockerfile on purpose we see error messages in Cloudwatch when the image is run by the task.
We appear to see multiple tasks being launched in ECS.
We’re using a pattern with Prefect Cloud and agents running on our own infrastructure. There is currently no egress, but I would expect the agent to at least log something to Cloudwatch when it launches, even if that’s just the Prefect ascii graphics that appear when it starts up.
k

Kevin Kho

04/28/2021, 4:18 PM
What do you think of adding an idempotency key to the flow and see if that prevents multiple ECS task creations?
b

Ben Collier

04/28/2021, 4:20 PM
I don’t think we’re getting to the point where that would be a thing. We’re literally just starting a basic agent up at this point and it has no flow associated with it. Unless I’m misunderstanding you.
k

Kevin Kho

04/28/2021, 4:21 PM
Oh I see what you mean
Will grab someone on the team with more ECS experience
b

Ben Collier

04/28/2021, 4:24 PM
Thanks Kevin
m

Mariia Kerimova

04/28/2021, 4:46 PM
Hello! 👋 You can take a look at this PR with example how to setup agent running as AWS Service. But because you are using Prefect Server, I believe you need to update
PREFECT__CLOUD__API
environment variable(line 317) and point to your api, something like
http://<your_ip>:4200/graphql
.
b

Ben Collier

04/28/2021, 4:47 PM
Hi Mariia. We’re using Prefect Cloud, is that the same thing? Sorry, I’m a bit new to the world of Prefect.
m

Mariia Kerimova

04/28/2021, 4:50 PM
oops, I misunderstood. So as I understand, for Cloud agent needs to talk to
"<https://api.prefect.io>"
, and it's not possible to achieve according to your network policies, right?
b

Ben Collier

04/28/2021, 4:50 PM
That’s correct, for now - but I would expect the agent to at least log out errors to Cloudwatch, right?
m

Mariia Kerimova

04/28/2021, 4:55 PM
hmm, good question, I think you should see at least some error logs in CloudWatch, but let me test it to be sure
b

Ben Collier

04/28/2021, 4:58 PM
Thanks Mariia
I doubt the agent has any of those roles available to it. Like I said, we’re in a very restricted environment. I’ll try giving it all those roles tomorrow and see what happens! An error, I expect.
m

Mariia Kerimova

04/28/2021, 5:01 PM
oh, yeah, for using CloudWatch, you definitely need to provide execution role with Cloud Watch permissions, otherwise agent will be prohibited to use CloudWatch api
b

Ben Collier

04/28/2021, 5:02 PM
Mariia, can you confirm for me - does the agent kick off additional tasks in the cluster once it’s running?
m

Mariia Kerimova

04/28/2021, 5:03 PM
yes, you are correct, this is a purpose of agent, it kicks off tasks for your flow runs
b

Ben Collier

04/28/2021, 5:07 PM
Okay, thanks. Then I guess the question is whether we would expect it to just bomb out if it didn’t have the right permissions.
m

Mariia Kerimova

04/28/2021, 5:14 PM
So, after you register task definition for agent, if you don’t have right permissions in Events you’ll see something like this:
b

Ben Collier

04/28/2021, 5:21 PM
We just tested it with no permissions at all and the agent bombed out with a stack trace.
m

Mariia Kerimova

04/28/2021, 5:33 PM
you mean you didn’t provide task role or execution role for agent, and it worked in ecs(not locally on your machine) and you can see logs in CloudWatch?