Hi all! We’re deploying the prefect ecs agent to F...
# prefect-server
b
Hi all! We’re deploying the prefect ecs agent to Fargate. Works fine locally, but on Fargate we’re getting neither a successful health check nor any logs appearing in Cloudwatch. Bear in mind that we can deliberately bork the build and see logs. So I’m wondering if anyone else has had a similar experience. Would it be better to start with the docker version of the agent and try to get that working first?
c
Hi Ben, we've setup a full CDK'd ECS/Fargate Prefect Agent deployment, maybe this might help? https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/cdk/bakery_stack.py#L170-L183
b
Thank you, much appreciated
c
No problem, let me know how it goes!
We're actively developing that project so we're keen on making it useful reference for folks
b
I notice you set up quite a high memory limit on the task.
c
That could very well be dropped. I wouldn't take the memory settings we've got as gospel
😅
b
So it all looks fairly standard - were there any particular hurdles you had to jump over to get it working? I’m clutching at straws here as we’re very confident that the AWS environment is configured correctly but can’t seem to get the agent to start correctly or even send anything to CloudWatch.
c
Are you able to check the
Stopped
tasks on your service? You might be experiencing errors pulling images
Usually if you can't see anything obvious, the Task is the failure point
Unfortunately, without being able to click around, I can't offer much help - though if you're able to, I'm more than happy to jump on a call if you are able to screenshare
b
Thanks Ciaran, that’s very generous of you.
Let’s see if we have much luck - otherwise I might take you up on that offer.
c
Cool no worries! But yeah, my first go-to would be seeing if your service has a graveyard of
Stopped
tasks
It may very well be stuck in a loop of trying to stand it up
b
I think we’re likely to have serious problems with importing images due to the closed environment.
Do you know if the plain Docker agent will run alright on Fargate?
c
Hmmm I don't know. Where are the images located? Could you use ECR?
That's what we're doing
b
Yeah, we could put the images on ECR.
I’m slightly nonplussed by the lack of log output though. I’d expect to see at least an error.
c
If it can't start the task, there wont be logs
The Service itself wont log
b
Hang on a second, I think we’re getting the conversation back to front.
c
So if it's a pull error, it's not even getting to where Cloudwatch would have access to logs
b
Haaaang on
Sorry, no I misunderstood what you were getting at. We pull the image onto ECR and then launch it with a task on ECS and Fargate.
That’s fine. It stands up and then starts additional tasks on the cluster.
As that’s taking place, we see no logs, and are getting failed health checks.
Ordinarily that would point to a problem with security groups on the load balancer, but we’re certain those are configured correctly.
What the agent can’t do is call out to the internet in any way.
There’s no egress at the moment.
c
How are you deploying this?
b
So obviously there’ll be issues, but we were expecting to see logs.
It’s Terraform and Gitlab. With a docker push to ECR.
c
Does your Fargate Service have a public IP?
b
No
c
Wont that be why there's no Egress then?
b
There’s no Egress on purpose at the moment.
Like I said, the environment is very locked down, we expect there to be a failure when the agent tries to call out to Prefect Cloud, but what we’re not currently seeing are any logs, and the agent is failing a health check (which is fine, but again - no logs)
c
Without this, I didn't get logs.
b
Yes, the environment for the container is identical to that which we’re using for other services, and they all log out without any problems.
c
It's on the container definition though, not the environment
b
c
Hmmm. I forget it's Terraform so you'll need more upfront declarations. Maybe the permissions around logging to Cloudwatch are incorrect?
My guess is CDK does some magic for me to give it rights to log out
b
They’re all default and the same as the others that work..
c
Hmmm.
At this point, I'm not sure then
b
Thanks for your help. It’s a bit of a head-scratcher.
We may try deploying the basic Docker version of the agent and seeing if that behaves differently.
It would be good to know if that’s a viable approach.
c
Not sure if it helps, but this is the ENTRYPOINT for our Agent image:
Copy code
ENTRYPOINT [ "prefect", "agent", "ecs", "start", "--agent-address", "http://:8080"]
b
We have something very similar.
c
We are then also passing in the ECS cluster arn and ECS Task Role arn
b
Likewise
Although in our case we’re starting it with a CMD
c
What image are you using for the Agent? Can you try just using a 'vanilla' prefect one? This is all we're using at the moment: https://github.com/pangeo-forge/pangeo-forge-aws-bakery/blob/main/images/agent/Dockerfile
b
We just do a pip install prefect[aws] and run that.
c
Try out that image and see what it does
👍 1
b
Hi Ciaran - when you say you’re then also passing in those values, how are you achieving that?
Via env vars?
c
@Ben Collier The arns?
So that when the container starts up it does
<ENTRYPOINT> <COMMANDS>
Ben, I think that ECR actually requires you to have internet access
I seem to remember having this where my instances without public IPs couldn't pull images.
b
We can load up images from ECR on all our other instances.
And those have got identical configurations.
Except that they’re running apps.
Django, Node etc.
c
Hmmm okay. I know that you can add VPC Endpoints to allow your locked down networks to access ECR...
b
We’re in a tightly controlled environment. I think the cloud team may have put in endpoints for that purpose.
That would explain why we can do it.
Of course, that doesn’t explain why the task loads one version of the agent and then tries to load lots more and fails.
Is that a behaviour - does the agent load additional agents?
I can see it launching additional tasks.
c
No, that isn't the usual behaviour, ours only runs one task
b
Hmm
c
b
And that’s how ours is configured.
But then it looks to me like I’m seeing additional tasks firing off.
Which I assumed came from the agent.
k
Hey, joining in late but just to summarize thus far 1. The tasks aren’t working. There are no successful health checks and cloudwatch logs. 2. The tasks are being duplicated. 3. You’re using Prefect Cloud? But being hit from a tightly controlled environment
b
Basically yes. We’ve proven the environment works, as it we bork the docker CMD in the Dockerfile on purpose we see error messages in Cloudwatch when the image is run by the task.
We appear to see multiple tasks being launched in ECS.
We’re using a pattern with Prefect Cloud and agents running on our own infrastructure. There is currently no egress, but I would expect the agent to at least log something to Cloudwatch when it launches, even if that’s just the Prefect ascii graphics that appear when it starts up.
k
What do you think of adding an idempotency key to the flow and see if that prevents multiple ECS task creations?
b
I don’t think we’re getting to the point where that would be a thing. We’re literally just starting a basic agent up at this point and it has no flow associated with it. Unless I’m misunderstanding you.
k
Oh I see what you mean
Will grab someone on the team with more ECS experience
b
Thanks Kevin
m
Hello! 👋 You can take a look at this PR with example how to setup agent running as AWS Service. But because you are using Prefect Server, I believe you need to update
PREFECT__CLOUD__API
environment variable(line 317) and point to your api, something like
http://<your_ip>:4200/graphql
.
b
Hi Mariia. We’re using Prefect Cloud, is that the same thing? Sorry, I’m a bit new to the world of Prefect.
m
oops, I misunderstood. So as I understand, for Cloud agent needs to talk to
"<https://api.prefect.io>"
, and it's not possible to achieve according to your network policies, right?
b
That’s correct, for now - but I would expect the agent to at least log out errors to Cloudwatch, right?
m
hmm, good question, I think you should see at least some error logs in CloudWatch, but let me test it to be sure
b
Thanks Mariia
I doubt the agent has any of those roles available to it. Like I said, we’re in a very restricted environment. I’ll try giving it all those roles tomorrow and see what happens! An error, I expect.
m
oh, yeah, for using CloudWatch, you definitely need to provide execution role with Cloud Watch permissions, otherwise agent will be prohibited to use CloudWatch api
b
Mariia, can you confirm for me - does the agent kick off additional tasks in the cluster once it’s running?
m
yes, you are correct, this is a purpose of agent, it kicks off tasks for your flow runs
b
Okay, thanks. Then I guess the question is whether we would expect it to just bomb out if it didn’t have the right permissions.
m
So, after you register task definition for agent, if you don’t have right permissions in Events you’ll see something like this:
b
We just tested it with no permissions at all and the agent bombed out with a stack trace.
m
you mean you didn’t provide task role or execution role for agent, and it worked in ecs(not locally on your machine) and you can see logs in CloudWatch?