Hi all , im trying to run an ecs agent over ec2 , ...
# prefect-server
g
Hi all , im trying to run an ecs agent over ec2 , prefect server, I can see fargate task are getting spawn at ecs , but it stops immediately, and my prefect flow is stuck in scheduled state
a
when the flows are stuck in a scheduled state, it’s usually due to label mismatch. You need to assign a label to your ECS agent and add the same label to your ECSRun. If you’re looking for an example, have a look at one of those two blog posts: • https://towardsdatascience.com/how-to-cut-your-aws-ecs-costs-with-fargate-spot-and-prefect-1a1ba5d2e2dfhttps://towardsdatascience.com/deploying-prefect-server-with-aws-ecs-fargate-and-docker-storage-36f633226c5f
g
Hi I have done that, the label is same for agent and flow
The second pic is ecsrun function
a
I see. The problem is that labels on ECSRun is a list. So it should be:
Copy code
ECSRun(labels=["prod"])
m
Hi I'm having this same issue, and my labels are matching, however I'm seeing in the cloudwatch logs that I'm getting a timeout when the container is starting up, when using prefect server, is there additional configuration needed on the task definition?
a
Yes, for Server, you need to set your Server endpoint as env variable here:
Copy code
"environment": [
                {
                    "name": "PREFECT__CLOUD__AGENT__LABELS",
                    "value": "['your_label']"
                },
                {
                    "name": "PREFECT__CLOUD__AGENT__LEVEL",
                    "value": "INFO"
                },
                {
                    "name": "PREFECT__CLOUD__API",
                    "value": "paste_your_Server_IP_here"
                }
this is part of the ECS task definition
m
Ah, I did not include cloud__agent__labels, let me try that
Still getting the same issues,
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='x.x.x.x', port=4200): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f66a952e430>, 'Connection to x.x.x.x timed out. (connect timeout=15)'))
I'm trying to run a test flow through this, used a docker container in ECS that is only installing pandas, and also tried using prefecthq/prefect-python3.8 with no success
Agent is still stuck in submitting flow run, and the ECS container has spun up and closed, due to the error above, I've gone through https://towardsdatascience.com/how-to-cut-your-aws-ecs-costs-with-fargate-spot-and-prefect-1a1ba5d2e2df and https://towardsdatascience.com/deploying-prefect-server-with-aws-ecs-fargate-and-docker-storage-36f633226c5f a few times, and i'm not sure where the disconnect is happening. I'm running the ECSAgent on an EC2 micro for the purpose of having our CI/CD building and registering nightly in lower environments
Thank you for the help/articles 🙂
a
Did you select public IP enabled? The timeout issue suggest that the request doesn't get pass the firewall rules on your Server. Can you give us a short summary of all steps you took so far and at what point exactly did you get stuck? 1) Can you confirm that the agent has been deployed successfully and that it's polling your Server API for flow runs? 2) what label did you assign to it? 3) what is your run configuration? Can you share it?
m
Good Morning, public IP enabled on the EC2 instance hosting the server? No, the infra team originally provisioned the instance until i nagged them to death and they gave me admin in our dev vpc. Steps talen: 1. after EC2's provisioned, installed packages/docker/docker-compose, built config.toml (Agent config points to server address). On server ufw enabled 4200/8080/8000/22 2. started server, then agent, confirmed agent is polling 3. created iam user for agent, iam roles for task and execution with s3 full access policy/custom policy provided by you in the article "How to cut your aws costs..." 4. exported variables as mentinoed in article, created cluster following cli steps (I did not create a task definition and push to ECS, as i'm not going to be deploying a long running service via ecs/fargate) 5. wrote flow task definition yaml and pushed to s3 bucket, built dockerfile (from prefecthq/prefect:latest-pyhon3.8) pushed to ECR 6. created 'testing' project, registered flow from my local using s3 storage, run_config is pointing to task definition yaml in s3, and run_task_kwargs using proper cluster/launch type 7. On 'quick run', flow stays in 'scheduled' (submitted for execution) and the task is created in ECS and the deployed container is unable to connect to server, task definition using same endpoint as agent that is polling
Your questsions: 1. Yes, it polling, and healthy per the server ui 2. ['dev']
3. Run Config =
Copy code
RUN_CONFIG = ECSRun(
    labels=["dev"],
    task_definition_path="<s3://bucket/folder/flow_task_definition.yaml>",
    run_task_kwargs=dict(cluster="prefectEcsCluster", launchType="FARGATE",)
task def yaml:
Copy code
family: prefectFlow
requiresCompatibilities:
  - FARGATE
networkMode: awsvpc
cpu: 1024
memory: 2048
taskRoleArn: arn:aws:iam::XXXX:role/ECSTaskS3ECRRole
executionRoleArn: arn:aws:iam::XXXX:role/ecsTaskExecutionRole
containerDefinitions:
  - name: flow
    image: "<http://XXXX.dkr.ecr.us-east-2.amazonaws.com/ecsflows:latest|XXXX.dkr.ecr.us-east-2.amazonaws.com/ecsflows:latest>"
    essential: true
    environment:
      - name: AWS_RETRY_MODE
        value: "adaptive"
      - name: AWS_MAX_ATTEMPTS
        value: "10"
      - name: PREFECT__BACKEND
        value: "server"
      - name: PREFECT__CLOUD__API
        value: "<http://x.x.x.x:4200>"
      - name: PREFECT__CLOUD__AGENT__LABELS
        value: "['dev']"
      - name: PREFECT__CLOUD__AGENT__LEVEL
        value: "INFO"
    logConfiguration:
      logDriver: awslogs
      options:
        awslogs-group: "/ecs/prefectEcsAgent"
        awslogs-region: "us-east-2"
        awslogs-stream-prefix: "ecs"
        awslogs-create-group: "true"
a
@Michael Moscater thank you for this excellent description of all the steps you’ve taken so far. It really helps. Currently, it looks like your process looks good, but the problem is how to allow communication between flow run containers spun up by ECS and your Server being served from EC2. The flow run containers (ECS tasks) must be able to access this endpoint:
Copy code
<http://x.x.x.x:4200>
Perhaps you can test it by spinning up a single ECS task with a container that tries to ping this endpoint. You could e.g. send a “hello” GraphQL query and check if you get an answer:
Copy code
query {
  hello
}
I suspect that you either have to: • enable public IP on both ECS tasks and on your Server instance so that flow run containers communicate with each other via public IP • or ensure that your ECS tasks with flow run containers get deployed to the same subnet as your EC2 Server instance e.g. by specifying networking configuration on your ECSRun:
Copy code
from prefect.run_configs import ECSRun

RUN_CONFIG = ECSRun(
    labels=["dev"],
    run_task_kwargs=dict(
        cluster="prefectEcsCluster",
        networkConfiguration={
            "awsvpcConfiguration": {
                "subnets": ["subnet_xxx",],
                "securityGroups": ["xxxx",],
                "assignPublicIp": "ENABLED",  # or 'DISABLED'
            }
        },
    ),
)
m
@Anna Geller - Thank you for your help, I'll go through what you mentioned. And let you know good or bad.
👍 1
@Anna Geller - Thank you for the help. The issue was for some reason, the two EC2 instances were on different regions, and therefore different VPCs. Once I fixed that, the ECSAgent would fail to start because for some reason on our default region, there is not a default vpc.. so I created a yaml file the configs needed and included that in the run-task-kwargs option in agent start.. the task was able to run and update its flows state. Hooray. But the task errored and failed. This I think was because I was using S3 flow storage, and the flow has an import from a .conf file
a
I see - in that case, you would need to include this .conf file on the EC2 instance that is hosting your agent. Alternatively (and this would be even better), you could include this file and any other dependencies within your custom Docker image that you could push to ECR. This way, the import errors should go away.
m
Excellent, yes i discussed both options with our devops team. Again, thank you for your patience and help
👍 1