Hi all, I am setting up a new ECS Push work pool ...
# prefect-cloud
b
Hi all, I am setting up a new ECS Push work pool against a new AWS Account that is using a Shared VPC from another one of our org's accounts. There were no security groups setup in this new account when I first started. I am testing a job run and it appears the ECS Task Definition is getting created in ECS but then fails with the following error in the Prefect Flow log:
Copy code
Flow run could not be submitted to infrastructure: An error occurred (InvalidParameterException) when calling the RunTask operation: At least one security group must be supplied when specifying subnets that are owned by a different account.
Based on the error, I assumed that I needed to create a VPC Security group which I did. I also added information to the "Network Configuration section in the Work Pool which looks like this:
Copy code
{
  "Subnets": [
    "subnet-xxxx",
    "subnet-yyyy"
  ],
  "SecurityGroups": [
    "sg-xxxyyy"
  ]
}
I am still getting the same error. I am at a loss for what is actually needed. Does anyone have thoughts on possible fixes?
k
I'm looking into this now
🙏 1
I'm not entirely sure if the capitalization matters here, but other reference I've found has examples like this:
Copy code
{
  "awsvpcConfiguration": {
    "subnets": ["string", ...],
    "securityGroups": ["string", ...],
    "assignPublicIp": "ENABLED"|"DISABLED"
  }
}
b
That is similar to another attempt I made which was this:
Copy code
{
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": [
        "subnet-XXXX",
        "subnet-YYYY"
      ],
      "securityGroups": [
        "sg-XXXYYY"
      ]
    }
  }
}
I will give your example a try.
Same error:
k
huh must be something about vpc peering rules that I'm not familiar with
I've been trying to read up on it to understand what the requirements are
b
Yeah, I trying to google the error but the phrase of "At least one security group must be supplied when specifying subnets that are owned by a different account." doesn't come up with any results. I am assuming the error isn't coming from AWS but rather built by Prefect based on a response from an AWS API call.
k
yeah, I'm trying to track down the exact error text
👍 1
it looks like it's a response from the AWS API that's we're just surfacing up a few layers
this
Flow run could not be submitted to infrastructure:
is a string in the push worker that wraps the error we get back from submitting the ECS task run via this method
what's the exact json in your work pool's network configuration field?
omitting the ids and such
b
Hmmm. Here is the json directly from the Network Config box in the work pool:
Copy code
{
  "awsvpcConfiguration": {
    "subnets": [
      "subnet-XXX",
      "subnet-YYY"
    ],
    "assignPublicIp": "DISABLED",
    "securityGroups": [
      "sg-XXXYYY"
    ]
  }
}
k
so based on the example request syntax from aws:
Copy code
networkConfiguration={
        'awsvpcConfiguration': {
            'subnets': [
                'string',
            ],
            'securityGroups': [
                'string',
            ],
            'assignPublicIp': 'ENABLED'|'DISABLED'
        }
    },
and the fact that our code does this:
Copy code
{"awsvpcConfiguration": network_configuration}
your network config should look like this:
Copy code
{
  "subnets": [
    "subnet-XXX",
    "subnet-YYY"
  ],
  "assignPublicIp": "DISABLED",
  "securityGroups": [
    "sg-XXXYYY"
  ]
}
b
I am looking at the Task Definition in AWS that Prefect is creating... There are 2 things that standout to me. The first is that there is no TaskRole defined even though it's set. The execution role is set though. The second is that there is no "networkConfiguration" section defined.
k
here's where I'm referencing our ECS worker code, the push pool does the same thing
that's because there are two main components to getting things running in ECS, the task definition and the task run request
Copy code
"task_definition": {
      "cpu": "{{ cpu }}",
      "family": "{{ family }}",
      "memory": "{{ memory }}",
      "executionRoleArn": "{{ execution_role_arn }}",
      "containerDefinitions": [
        {
          "name": "{{ container_name }}",
          "image": "{{ image }}"
        }
      ]
    },
    "task_run_request": {
      "tags": "{{ labels }}",
      "cluster": "{{ cluster }}",
      "overrides": {
        "cpu": "{{ cpu }}",
        "memory": "{{ memory }}",
        "taskRoleArn": "{{ task_role_arn }}",
        "containerOverrides": [
          {
            "cpu": "{{ cpu }}",
            "name": "{{ container_name }}",
            "memory": "{{ memory }}",
            "command": "{{ command }}",
            "environment": "{{ env }}"
          }
        ]
      },
      "launchType": "{{ launch_type }}",
      "taskDefinition": "{{ task_definition_arn }}"
    },
b
Gotcha
k
the network stuff gets attached to the run request like so to match the request spec
👍 1
b
I just changed the network config json to this:
Copy code
{
  "subnets": [
    "subnet-XXX",
    "subnet-YYY"
  ],
  "assignPublicIp": "DISABLED",
  "securityGroups": [
    "sg-XXXYYY"
  ]
}
I get the same error.
k
maybe you need to link your SG to an SG in the peered VPC?
b
I just asked an AWS infra person on our side and he said that we are not VPC peering. We are doing VPC sharing. I'm am not certain of the difference but he seems to think there is a difference.
I'm out of my depth here but trying to learn
b
HA! You and me both! I appreciate you flailing with me.
🙇 1
I have some more information. I was able to find the task failure in AWS CloudTrail and it has the error shown in Prefect.
It appears to me that none of the networkConfiguration information is being sent or it's being sent incorrectly.
🤔 1
k
and you have a vpc id configured on your work pool too I assume
b
This is the redacted event record from CloudTrail:
Copy code
{
  "eventVersion": "1.09",
  "userIdentity": {
    "type": "IAMUser",
    "principalId": "XXX",
    "arn": "arn:aws:iam::XXX:user/XXX",
    "accountId": "XXX",
    "accessKeyId": "XXX",
    "userName": "XXX"
  },
  "eventTime": "2024-05-31T22:38:22Z",
  "eventSource": "<http://ecs.amazonaws.com|ecs.amazonaws.com>",
  "eventName": "RunTask",
  "awsRegion": "us-west-2",
  "sourceIPAddress": "<http://35.227.XXX.XXX|35.227.XXX.XXX>",
  "userAgent": "Boto3/1.28.2 md/Botocore#1.31.2 ua/2.0 os/linux#6.1.75+ md/arch#x86_64 lang/python#3.11.7 md/pyimpl#CPython cfg/retry-mode#legacy Botocore/1.31.2",
  "errorCode": "InvalidParameterException",
  "errorMessage": "At least one security group must be supplied when specifying subnets that are owned by a different account.",
  "requestParameters": {
    "cluster": "arn:aws:ecs:us-west-2:XXX:cluster/prefect-ecs-cluster",
    "enableECSManagedTags": false,
    "enableExecuteCommand": false,
    "launchType": "FARGATE",
    "networkConfiguration": {
      "awsvpcConfiguration": {
        "assignPublicIp": "ENABLED",
        "securityGroups": [],
        "subnets": [
          "subnet-XXX",
          "subnet-XXXX",
          "subnet-XXXXX",
          "subnet-XXXXXX"
        ]
      }
    },
    "overrides": {
      "containerOverrides": [
        {
          "name": "prefect",
          "command": [
            "python",
            "-m",
            "prefect.engine"
          ],
          "environment": "HIDDEN_DUE_TO_SECURITY_REASONS"
        }
      ],
      "taskRoleArn": "arn:aws:iam::XXX:role/prefect_task_role"
    },
    "tags": [
      {
        "key": "<http://prefect.io/flow-run-id|prefect.io/flow-run-id>",
        "value": "976fa3af-249a-4f3e-9312-c2928b6daXXX"
      },
      {
        "key": "<http://prefect.io/flow-run-name|prefect.io/flow-run-name>",
        "value": "hungry-kittiwake"
      },
      {
        "key": "<http://prefect.io/deployment-id|prefect.io/deployment-id>",
        "value": "3b2a0441-5d3a-4127-afe2-d46cbc59XXXX"
      },
      {
        "key": "<http://prefect.io/deployment-name|prefect.io/deployment-name>",
        "value": "dev-prefect_testing-someflow"
      },
      {
        "key": "<http://prefect.io/deployment-updated|prefect.io/deployment-updated>",
        "value": "2024-05-30T20:29:25.791701Z"
      },
      {
        "key": "<http://prefect.io/flow-id|prefect.io/flow-id>",
        "value": "58f7206e-ed29-4f75-8187-662f6ee8XXXX"
      },
      {
        "key": "<http://prefect.io/flow-name|prefect.io/flow-name>",
        "value": "someflow"
      }
    ],
    "taskDefinition": "arn:aws:ecs:us-west-2:XXX:task-definition/prefect__3b2a0441-5d3a-4127-afe2-d46cbc59c5ce__41dd128a-1674-4611-b95b-a9a39f4aXXXX:3"
  },
  "responseElements": null,
  "requestID": "dc521906-6ccc-4e6c-b52d-97ae0cb2XXXX",
  "eventID": "16c4075a-0038-4185-83e7-3ddef8bbXXXX",
  "readOnly": false,
  "eventType": "AwsApiCall",
  "managementEvent": true,
  "recipientAccountId": "XXXX",
  "eventCategory": "Management",
  "tlsDetails": {
    "tlsVersion": "TLSv1.3",
    "cipherSuite": "TLS_AES_128_GCM_SHA256",
    "clientProvidedHostHeader": "<http://ecs.us-west-2.amazonaws.com|ecs.us-west-2.amazonaws.com>"
  }
}
Yes, I have a VPC ID configured on the Work Pool as well as the network config
k
okay this is helpful, guessing those subnets are all the subnets in the VPC rather than the ones you specified, which would be the default behavior if we didn't get any of your customizations
b
Correct
k
okay let me take a direct look at your work pool
👍 1
b
It's the one labeled: aws_development_account
k
thanks
👍 1
b
Interestingly, if I switch that work pool to advanced, I see the property called network_configuration but I don't see it used anywhere in the job_configuration or task_run_request. I would expect to see a reference to network_configuration somewhere within the task_run_request.
k
it's not in the template because it's handled entirely in code. let me grab a link to show how we check for and then handle it
b
Oh gotcha
basically we check if you've supplied both a network config and vpc id, and if you have, the network config you provide is validated and then attached to the request
though I wonder if you'd get a different outcome if you hardcoded it in the advanced tab and left it out of the field in defaults
eh that might make weird things happen
b
I mean, the fact that it's not included as part of the task run, I don't know how it could hurt. I can always switch it back if it doesn't work.
Hey Kevin, I am just getting around to circling back on this. I am curious if someone on the Prefect side can do some testing on their end. I have changed the network configuration as many ways I can think of and still don't see that the network configuration is actually making it to AWS. CloudTrail shows that the ECS Task is coming in with the default network configuration. One more observation I had was that an existing ECS:Push Worker pool that we have had for several months doesn't even have the Network Configuration section yet the new one does. I am inclined to assume that the network configuration option is new. Is that correct?
k
I can check to see when it was added, and test the network config on an ecs push pool I have set up myself
👍 1
b
Not that when it was added helps all the much but it might indicate that it hasn't been used by anyone yet and we are the first to try it and we are running into some bug with it.
k
yep, understood
@Bryan figured it out
🙏 1
there's a field missing on the base job template on the push work pool. You can fix it right now by adding
Copy code
},
      "launchType": "{{ launch_type }}",
      "taskDefinition": "{{ task_definition_arn }}"
    },
    "network_configuration": "{{ network_configuration }}",  <----- this line
    "cloudwatch_logs_options": "{{ cloudwatch_logs_options }}",
    "configure_cloudwatch_logs": "{{ configure_cloudwatch_logs }}",
    "task_start_timeout_seconds": "{{ task_start_timeout_seconds }}",
    "auto_deregister_task_definition": "{{ auto_deregister_task_definition }}"
  }
}
to the json on the advanced tab. we'll get this fixed asap so newly created push work pools have it on there
🙏 1
b
That was my observation last week and I almost added it but wasn't 100% sure on the name of it. I will give this a try. I appreciate your help and the diligence!
I added the network_configuration to the template by editing the work pool, switching it to Advanced, then adding the network_configuration line you posted. I ran my test job again and still ended up with the same issue. I didn't redeploy because I wouldn't think that would have anything to do with the running of the pipeline. I can see in AWS CloudTrail the same thing I was getting before where it's just supplying the default networkConfiguration object.
Ignore the previous message Kevin. I just noticed that I have a typo in my template
👍 1
It looks good and my test job ran fine. Thanks for the help on this one!
k
np! working on fixing it permanently too. thanks for helping me investigate!
👍 1
b
You bet! I am glad we could help each other out.