The upgrade from 0.14.6 to 0.14.12 broke my ECS/Fa...
# ask-community
r
The upgrade from 0.14.6 to 0.14.12 broke my ECS/Fargate implementation. From what I can tell the requiredCompatibilities isn’t getting set correctly in the task definition that the Agent is registering with ECS via boto3. My flows worked on 0.14.6 but with the revamp of ECS on 0.14.12 they all get this error:
Copy code
[2021-03-15 17:48:45,887] ERROR - rai-fargate | Error while deploying flow
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/prefect/agent/agent.py", line 414, in deploy_and_update_flow_run
    deployment_info = self.deploy_flow(flow_run)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/prefect/agent/ecs/agent.py", line 322, in deploy_flow
    resp = self.ecs_client.run_task(taskDefinition=taskdef_arn, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: Task definition does not support launch_type FARGATE.
I checked the registered task definition and I can see:
Copy code
"compatibilities": [
    "EC2"
  ],
When I do the same with 0.14.6 I see:
Copy code
"compatibilities": [
    "EC2",
    "FARGATE"
  ],
Thx!
z
Hi @Robert Bastian -- there's a tracking issue for this at https://github.com/PrefectHQ/prefect/issues/4243
Would you mind explaining a bit more of what your setup is? We don't manually set the capabilities.
r
@Zanie Please let me know what else I can provide: Here is my ECSRun:
Copy code
RUN_CONFIG = ECSRun(
    labels=["s3-flow-storage", "rai-fargate-local"],
    image="{redacted}.<http://dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest|dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest>",
    memory="512",
    cpu="256",
)
Storage:
Copy code
STORAGE = S3(bucket="prefect-rai-dev")
Agent:
Copy code
prefect agent ecs start --token {token} --cluster RAI --task-role-arn=arn:aws:iam::{redacted}:role/Prefect_Container_Role --execution-role-arn=arn:aws:iam::{redacted}:role/Prefect_Task_Execution_Role --log-level DEBUG --label rai-fargate-local --label s3-flow-storage --name rai-fargate
One other thing I can say after reading the thread is that I have flows that pass a networkConfiguration and flows that don’t. Neither work after upgrading to 0.14.12. I also tried explicitly setting launch-type to “FARGATE”, that also did not work.
j
Before did you ever set a custom task definition template? We never set
compatibilities
in prefect itself, so it's not clear where this change is coming from.
We did however cache task definitions, and only register a new task definition if the version of the flow changed. Due to run configs now being configurable on individual flow runs, we no longer can do this (turns out registering a new definition for each run isn't that bad anyway), but before existing task definitions would have persisted until the flow re-registered.
r
I never set a custom task definition, I don’t believe.
I’ve always relied on it being generated.
j
Hmmm
When you upgraded prefect, did it also bump the boto3 version? Not sure if a change in that would matter, just trying to eliminate other options.
Alternatively, if you downgrade prefect (but keep the rest of your environment exactly the same) do things go back to working?
r
Regarding boto3: I don’t believe so because I’ve locked boto3 in the container that runs the ECS agent. Downgrading to 0.14.6 gets flows running again. I was going to try 0.14.11 just to narrow it a bit more.
@Jim Crist-Harif Sorry, I misspoke. Here is my Dockerfile for the Agent:
Copy code
FROM python:3.9-slim-buster

ENV PREFECT_VERSION=0.14.6

RUN apt-get update && apt-get install -y gcc

RUN pip install prefect[aws]==${PREFECT_VERSION}

COPY agent.py /agent.py
COPY --from=arpaulnet/s6-overlay-stage:2.0 / /

ENTRYPOINT ["/init"]

CMD ["python", "agent.py"]
I don’t explicitly install boto3. I’m relying on the Prefect “extras”
In the container runtime - where the task executes, I do this: RUN pip install boto3==1.16.44
j
It might be a boto3/botocore change then. If you could inspect the environment that was successfully running the agent to determine the working version of boto3 & botocore, and then try it with prefect 0.14.12 but with the same boto3 and botocore versions as before that'd be useful.
👀 1
This should only affect the agent environment, so it shouldn't matter what image you're using to run your flows.
z
The only thing I can find in AWS-land is https://github.com/aws/aws-cli/issues/3983
j
Hmmm, so maybe we should be setting
requiresCompatibilities
on our task definitions. Still doesn't explain why the compatibilities changed for Robert when prefect was upgraded though (since we never set those and still don't).
r
jim -
Copy code
root@f88173e2ea59:/# python
Python 3.9.2 (default, Mar 12 2021, 19:04:51)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import boto3
>>> print(boto3.__version__)
1.17.27
>>> import botocore
>>> print(botocore.__version__)
1.20.27
>>> import prefect
>>> print(prefect.__version__)
0.14.12
z
I also discovered this useful changelog--very hard to find where AWS documents API changes otherwise https://awsapichanges.info/archive/service/ecs/
r
0.14.6:
Copy code
root@9b11bf76222c:/# python
Python 3.9.2 (default, Mar 12 2021, 19:04:51)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import prefect
>>> print(prefect.__version__)
0.14.6
>>> import boto3
>>> print(boto3.__version__)
1.17.27
>>> import botocore
>>> print (botocore.__version__)
1.20.27
Later I’ll do a completely new flow from scratch to see if it works on 0.14.12.
j
That's unexpected. The image you're using with an older version of prefect (0.14.6) is using the (almost) latest version of boto3 (released 3 days ago)?
(not that there's anything wrong with that, I was just expecting there to be a difference in versions)
While I'm still curious why things changed, it'd also be good to see if we can get things working for you on 0.14.12. Can you try adding a custom task definition to set
requiresCompatibilities
on your task definition? Something like this should work:
Copy code
import yaml

definition = yaml.safe_load(
"""
networkMode: awsvpc
cpu: 1024
memory: 2048
requiresCompatibilities:
  - name: FARGATE
containerDefinitions:
  - name: flow
"""
)

flow.run_config = ECSRun(task_definition=definition)
r
FYI - 0.14.11 still works also, so this is definitely related to 0.14.12.
@Jim Crist-Harif this worked:
Copy code
definition = yaml.safe_load(
    """
    networkMode: awsvpc
    cpu: 1024
    memory: 2048
    requiresCompatibilities:
        - FARGATE
    containerDefinitions:
        - name: flow
    executionRoleArn: aws:iam::{redacted}:role/Prefect_Task_Execution_Role
    """
)

RUN_CONFIG = ECSRun(
    image= '{redacted}.<http://dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest|dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest>',
    task_definition=definition,
    labels=["s3-flow-storage"],
    memory="512",
    cpu="256",
)
It did take me 23 revisions to get the YAML just right (I hate YAML, btw). I also don’t like that the name must equal “flow”. That doesn’t seem intuitive.
j
Excellent, glad to hear it. Still confused what changed in 0.14.12 that broke things for you, but we can set
requiresCompatibilities
ourselves (I think) and everything should still work. Should get a PR in for the next release.
I hate YAML, btw
Yaml isn't required here, it was just the quickest way I could copy over the existing def and add the bit you needed. A nested dict/list structure in python would have worked just as well. If you specify a custom definition via
task_definition_path
, then the file must contain yaml (or json, which is a subset of yaml).
I also don’t like that the name must equal “flow”. That doesn’t seem intuitive.
Since task definitions may contain multiple containers, we need a way to know which container contains the flow for prefect to fill in the image/command/etc... To make it easier for users to define sidecar containers and leave the main container undefined we use the name
flow
as a designator, rather than assuming the first container definition. This is done for both ECS and k8s agents. Note that if you don't customize the containers at all you can leave the
containerDefinitions
field out completely.
👍 1
r
@Jim Crist-Harif OK, I went back an examined the task definitions submitted by 0.14.6 and 0.14.12 and used a diff tool to see what was what. The crux seems to be that the 0.14.6 set the task_execution_role in the task definition while 0.14.12 did not. Because the role was set, EC2 validated that the flow was compatible with FARGATE. When the role is not there, FARGATE is not an option because you need (I assume) special permissions in the execution role. I verified this by removing requriedCompatibilities from the yaml, but leaving the task execution role, and flows work on 0.14.12. Interestingly, setting the task execution role in the RUN_CONFIG does not fix the issue. BTW - When you specify FARGATE in requiredCompatibilies you are forced to set the execution role arn.
j
Because the role was set, EC2 validated that the flow was compatible with FARGATE. When the role is not there, FARGATE is not an option because you need (I assume) special permissions in the execution role.
Ah, interesting. That doesn't happen in my test cluster (0.14.12 deploys fine on fargate in my test cluster), I wonder if there's some cluster-level settings that lead to this behavior? ECS has too many knobs. Since `task_role_arn`/`execution_role_arn` are top-level settings on both the agent and
ECSRun
objects, we moved setting these at runtime so user-provided templates wouldn't need them. It's annoying that
execution_role_arn
seems to be required for at least your configuration. Thanks for the info, this is all useful for figuring out how we should resolve this.
What type of ECS cluster are you using? (The doc here refers to them as "Networking only", "EC2 Linux + Networking", and "EC2 Windows + Networking").
I've been testing fargate usage with a "Networking only" cluster with all the default settings, and things seem to work fine. Trying to figure out what differs between our setups.
r
Copy code
[rbastian@E007254-MAR18 ecs (master)]$ aws ecs describe-clusters --clusters RAI
{
    "clusters": [
        {
            "clusterArn": "arn:aws:ecs:us-east-1:{redacted}:cluster/RAI",
            "clusterName": "RAI",
            "status": "ACTIVE",
            "registeredContainerInstancesCount": 0,
            "runningTasksCount": 0,
            "pendingTasksCount": 0,
            "activeServicesCount": 0,
            "statistics": [],
            "tags": [],
            "settings": [
                {
                    "name": "containerInsights",
                    "value": "enabled"
                }
            ],
            "capacityProviders": [],
            "defaultCapacityProviderStrategy": []
        }
    ],
    "failures": []
}
Even after setting default capacity provider to FARGATE and capacity provider to FARGATE I still need to set the execution_role_arn on the task definition.
j
I have
"capacityProviders": ["FARGATE", "FARGATE_SPOT"]
, but other than that we're identical. How did you create a cluster without capacity providers? When creating a new cluster via the console with no other configuration this is what I get.
Ah, with the CLI you don't get any by default. That's annoying. Ok, assuming that with no
capacityProviders
I can replicate your issue, I think I have enough to go on now. Thanks for your help in getting to the bottom of this issue!
👍 1
r
I destroyed my cluster and recreated it from the console. I have both FARGATE and FARGATE_SPOT as capacity providers but no default capacity provider strategy. I still get the same issue where I must have a task def with an execution role specified when I register the flow. I’m going to back down to 0.14.11 until there is a better workaround that adding a taskdef to all my flows.
j
Was that still with a custom task definition, or using the default provided by prefect? Do you get an error saying the task definition is invalid and requires an execution role? Or an error at execution time where e.g. you lack permissions to pull the image (what the execution role is for).
r
The error is “Task definition does not support launch_type FARGATE”. This is the original error. I am using the default task definition yaml. I am using s3 storage with stored_as_script=True My Agent has the execution_role_arn and task_role_arn specified. The only way I can get 0.14.12 to work is to use a custom task definition with the execution_role_arn specified.
j
Hmmm, I wonder if a default execution role was never created for you since you used the cli instead? The docs here are a bit vague (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html), specifically looking at:
An Amazon ECS task execution role is automatically created for you in the Amazon ECS console first-run experience
Using the default cluster created from the console everything works fine for me, trying to figure out what's different about your setup leading to this.
r
My new cluster was created using the console.
j
right, but there may be some other bit of state that's missing 🤷
In the link I posted above, can you follow the instructions after "You can use the following procedure to check and see if your account already has the Amazon ECS task execution role" to see if that role exists?
👀 1
r
I do not have the default task execution role. I think we created it manually and called it something else. It looks like it has to be named specifically “ecsTaskExecutionRole”.
j
I wonder if ECS does something implicit if that role exists, which is why I don't get the errors you're getting.
r
I don’t have IAM permissions in our AWS tenet. I need to get some help to get it created, then I will retest.
👍 1
j
Thanks for working through this with me, I think we're getting closer to understanding what the issue is here.
r
H2H!
OK, so I changed my task execution role to be named “ecsTaskExecutionRole”. This still didn’t help. The weird part of this is that the execution_role_arn I specify does not need to exist. I can put anything including “junk” to make it work. It just cannot be null. I’m going to drop and recreate the cluster again and see if the behavior changes.
Dropping and recreating the cluster from the console does not help.
j
This is so weird
Hmmm. Perhaps for now I'll revert to setting those on the generated definition & at runtime. This would alleviate your current issue, and only users that were specifying their own definition would run into these limitations (if at all).
r
Can you verify that your task definition in aws does not have an execution role arn set?
j
Copy code
{
  "taskDefinition": {
    "taskDefinitionArn": "...",
    "containerDefinitions": [
      {
        "name": "flow",
        "image": "prefecthq/prefect:0.14.12",
        "cpu": 0,
        "portMappings": [],
        "essential": true,
        "environment": [
          {
            "name": "PREFECT__CONTEXT__IMAGE",
            "value": "prefecthq/prefect:0.14.12"
          }
        ],
        "mountPoints": [],
        "volumesFrom": []
      }
    ],
    "family": "prefect-test-ecs",
    "networkMode": "awsvpc",
    "revision": 8,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
      {
        "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
      },
      {
        "name": "ecs.capability.task-eni"
      }
    ],
    "placementConstraints": [],
    "compatibilities": [
      "EC2",
      "FARGATE"
    ],
    "cpu": "1024",
    "memory": "2048"
  }
}
r
Does your aws user have the policy for the task execution role assigned to it? My task definition has this attribute:
Copy code
"registeredBy": "arn:aws:iam::070551638384:user/prefect-service"
j
nope
r
I did some more testing on this issue and here is what I found. I switched my storage from s3 to github and I changed my run_config to omit the image name and now the flow runs as expected. I then switched my storage back to s3, leaving the image omitted from the run_config and the flow runs as expected. Putting the image in the run_config, regardless of storage type causes the error:
Copy code
Task definition does not support launch_type FARGATE.
You can see in the working task definition below that FARGATE is listed, but the only real difference I can see if that the image is now the default prefect image.
Copy code
{
  "taskDefinition": {
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:070551638384:task-definition/prefect-github-say-hello-flow:8",
    "containerDefinitions": [
      {
        "name": "flow",
        "image": "prefecthq/prefect:0.14.12",
        "cpu": 0,
        "portMappings": [],
        "essential": true,
        "environment": [
          {
            "name": "PREFECT__CONTEXT__IMAGE",
            "value": "prefecthq/prefect:0.14.12"
          }
        ],
        "mountPoints": [],
        "volumesFrom": []
      }
    ],
    "family": "prefect-github-say-hello-flow",
    "networkMode": "awsvpc",
    "revision": 8,
    "volumes": [],
    "status": "INACTIVE",
    "requiresAttributes": [
      {
        "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
      },
      {
        "name": "ecs.capability.task-eni"
      }
    ],
    "placementConstraints": [],
    "compatibilities": [
      "EC2",
      "FARGATE"
    ],
    "cpu": "1024",
    "memory": "2048",
    "registeredAt": 1617140231.937,
    "deregisteredAt": 1617140232.827,
    "registeredBy": "arn:aws:iam::070551638384:user/prefect-service"
  }
}
Just to clarify, my image on ECR is also 0.14.12.
j
Finally was able to reproduce the issue locally, should be fixed by https://github.com/PrefectHQ/prefect/pull/4325 and out in the next release. Thanks for your patience and work investigating this issue.