Hey! One more issue I'm running into with infra de...
# prefect-aws
c
Hey! One more issue I'm running into with infra deployment on an agent as a service, details in 🧵
1
I set up agents as a service following this template, making a few adjustments to deploy with terraform and use our networking infra, etc. Agent task and service def here:
Copy code
resource "aws_ecs_task_definition" "prefectAgent" {
  family = "prefectAgent_${var.env}"
  requires_compatibilities =  ["FARGATE"]
  network_mode = "awsvpc"
  cpu = "512"
  memory = "1024"
  task_role_arn = aws_iam_role.prefect_agent_ecs_task_role.arn
  execution_role_arn = aws_iam_role.prefect_agent_ecs_execution_role.arn
  container_definitions = <<TASK_DEFINITION
  [
    {
        "name": "prefectAgent_${var.env}",
        "image": "${data.aws_caller_identity.current.account_id}.dkr.ecr.${data.aws_region.current.name}.<http://amazonaws.com/integrations_${var.env}:latest|amazonaws.com/integrations_${var.env}:latest>",
        "entryPoint": ["bash", "-c"],
        "stopTimeout": 120,
        "environment": [
                {"name": "PREFECT_LOGGING_LEVEL", "value": "INFO"}
        ],
        "command": ["prefect agent start -q ${var.env}"],
        "logConfiguration": {
            "logDriver": "awslogs",
            "options": {
                "awslogs-group": "${aws_cloudwatch_log_group.prefect_agent.name}",
                "awslogs-region": "us-east-2",
                "awslogs-stream-prefix": "prefect_agent"
            }
        },
        "secrets": [
            {"Name": "PREFECT_API_KEY", "ValueFrom": "${data.aws_secretsmanager_secret_version.prefect_api.arn}:PREFECT_API_KEY::"},
            {"Name": "PREFECT_API_URL", "ValueFrom": "${data.aws_secretsmanager_secret_version.prefect_api.arn}:PREFECT_API_URL::"}
        ]
    }
  ]
  TASK_DEFINITION
}

# Agent Service
resource "aws_ecs_service" "prefectAgentService" {
    name = "prefectAgentService_${var.env}"
    cluster = data.aws_ecs_cluster.internet_cluster.id
    task_definition = aws_ecs_task_definition.prefectAgent.arn
    desired_count = 1
    launch_type = "FARGATE"
    network_configuration {
        subnets = data.aws_subnets.app_subnets.ids
        security_groups = data.aws_security_groups.app_security_groups.ids
    }
}
However, when I try a flow run I'm getting
Failed to get infrastructure for flow
. I'm not seeing any other detail in the logs produced by the agent or the flow run itself. I develop on an EC2 instance, when I spin up an agent there and deploy to it, the flow runs and succeeds. I've got a couple potential culprits as to why it fails on the agent as a service, but I haven't been able to nail down the issue yet: • IAM permissions for one of the roles applied to the agent, perhaps the task role? I added "ec2:DescribeSecurityGroups" since SGs are required for us • Updating the agent image to one that has the prefect-ecs package loaded (saw that on another thread and that change is made here, did not resolve the issue) • The agent and the flow tasks current run in different ECS clusters, it didn't seem to resolve the issue when I deployed them in the same one Other thoughts/suggestions here?
m
Hey @Claire Herdeman would you be able to share the full traceback for the agent/flow when this occurred? Permissions issues are a common reason we see this. come up but it's hard to say for certain.
c
Yeah, unfortunately there's no additional traceback on the agent! Truly only said "failed to get infra", no logs are writing the the flow run
m
Can you try setting this
PREFECT_LOGGING_LEVEL='DEBUG'
on the agents environment and see if that turns up anything extra when this runs? it might help provide some more context
👍 1
c
Hm same result, and you can see that the log level changed
m
Hmm, yeah that's not much to go on, let me ask around a bit and see if I can dig something up, I have seen that error come up before but it's not always the same issue.
👍 1
c
Perfect, thank you! Here's the task role def if that ends up helping, it's the same as the template + "DescribeSecurityGroups" and "DescribeClusters". Those changes didn't resolve the issue though
Copy code
resource "aws_iam_role" "prefect_agent_ecs_task_role" {
    name = "prefect_agent_task_role_${var.env}"
    force_detach_policies = true
    assume_role_policy = jsonencode({
        Version = "2012-10-17"
        Statement = [
            {
                Action = "sts:AssumeRole"
                Effect = "Allow"
                Sid    = ""
                Principal = {
                    Service = "<http://ecs-tasks.amazonaws.com|ecs-tasks.amazonaws.com>"
                }
            },
        ]
    })
    inline_policy {
        name = "PrefectS3Storage"
        policy = jsonencode({
            Version = "2012-10-17"
            Statement = [
                {
                    Action   = [
                        "s3:ListAllMyBuckets"
                    ]
                    Effect   = "Allow"
                    Resource = "arn:aws:s3:::*"
                },{
                    Action   = [
                        "s3:ListBucket",
                        "s3:GetBucketLocation"
                    ]
                    Effect   = "Allow"
                    Resource = "arn:aws:s3:::${aws_s3_bucket.augintel-prefect-flows.bucket}"
                },{
                    Action   = [
                        "s3:ListBucket",
                        "s3:GetBucketLocation"
                    ]
                    Effect   = "Allow"
                    Resource = "arn:aws:s3:::${aws_s3_bucket.augintel-prefect-flows.bucket}"
                },{
                    Action   = [
                        "s3:PutObject",
                        "s3:PutObjectAcl",
                        "s3:GetObject",
                        "s3:GetObjectAcl",
                        "s3:DeleteObject",
                        "kms:Decrypt",
                        "kms:Encrypt",
                        "kms:GenerateDataKey"
                    ]
                    Effect   = "Allow"
                    Resource = [
                        "arn:aws:s3:::${aws_s3_bucket.augintel-prefect-flows.bucket}/*",
                        aws_kms_key.prefect-flow-key.arn
                    ] 
                },{
                    Action   = [
                        "ecs:RegisterTaskDefinition",
                        "ecs:DeregisterTaskDefinition",
                        "ecs:DescribeTasks",
                        "ecs:RunTask"
                    ]
                    Effect   = "Allow"
                    Resource = "*"
                },{
                    Action   = [
                        "logs:GetLogEvents"
                    ]
                    Effect   = "Allow"
                    Resource = "*"
                },{
                    Action   = [
                        "ec2:DescribeSubnets",
                        "ec2:DescribeVpcs",
                        "ec2:DescribeSecurityGroups",
                        "ecs:DescribeClusters"
                    ]
                    Effect   = "Allow"
                    Resource = "*"
                },
                {
                    Action   = [
                        "ssmmessages:CreateControlChannel",
                        "ssmmessages:CreateDataChannel",
                        "ssmmessages:OpenControlChannel",
                        "ssmmessages:OpenDataChannel"
                    ]
                    Effect   = "Allow"
                    Resource = "*"
                }
            ]
        })
    }
}
z
This is being raised at https://github.com/PrefectHQ/prefect/blob/main/src/prefect/agent.py#L237-L239 and should dump the traceback too. Is it in the dropdown?
👍 1
c
But no, still no more to the traceback
z
Is this your only interface for the logs? From the code, the traceback should be displayed.
c
Hmmm, yes these are all of the logs being produced by the agent
when i was running an agent locally on my ec2 i was getting more traceback detail when i ran into errors, I keep coming back to IAM permissions on the agent setup and wondering if they're blocking info coming through somewhere
Ah, taking a look it looks like nothing is invoking the agent task role, just the execution role. i wonder if most of the permissions on the task role need to move to execution?
Haha well I can confirm: adding the agent task permissions to the execution role does not result in a more detailed traceback OR the infra pulling successfully
Any thoughts on this? I'm out of ideas. A couple additional details: • I tried a couple permutations of adding more permissions to the execution or task roles on the agent, didn't seem to resolve but still totally possible there's something I'm missing • My company uses bastion architecture (for HIPAA compliance) which sometimes makes networking issues complicated. However the agent is successfully pinging the app so it doesn't seem like that should be causing an issue. I'm not seeing any traffic to the agent getting rejected. Would you anticipate something networking related causing an issue in pulling infra? • I tried digging through some cloudtrail logs, those are a bit if a pain but I didn't find any errors
z
Can you run this somewhere where you have access to the actual stdout using the permissions that are available in ECS? We’re in a tough spot here since the traceback isn’t being captured by CloudTrail.
c
Hm, I just tried attaching a role with permissions similar to the execution role to my ec2 instance and it still deployed with no problem (though my creds would still be available)
z
Ah I’ve reproduced the lack of traceback locally!
🎉 1
There’s a bug in 2.6.7 where the traceback is not displayed.
If you run 2.6.6 you should get the actual error.. I’ll look into fixing that asap.
🙌 2
c
Ok that seems to have done it! My test flow succeeded
Thanks for your help today @Zanie and @Mason Menges!
z
Great! Here’s the PR for the logging fix: https://github.com/PrefectHQ/prefect/pull/7558 — should be out tomorrow 🙂
🙌 2