The upgrade from 0 14 6 to 0 14 12 broke my ECS Fargate impl Prefect Community #ask-community

The upgrade from 0.14.6 to 0.14.12 broke my ECS/Fa...

Robert Bastian

03/15/2021, 5:52 PM

The upgrade from 0.14.6 to 0.14.12 broke my ECS/Fargate implementation. From what I can tell the requiredCompatibilities isn’t getting set correctly in the task definition that the Agent is registering with ECS via boto3. My flows worked on 0.14.6 but with the revamp of ECS on 0.14.12 they all get this error:

Copy code

[2021-03-15 17:48:45,887] ERROR - rai-fargate | Error while deploying flow
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/prefect/agent/agent.py", line 414, in deploy_and_update_flow_run
    deployment_info = self.deploy_flow(flow_run)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/prefect/agent/ecs/agent.py", line 322, in deploy_flow
    resp = self.ecs_client.run_task(taskDefinition=taskdef_arn, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: Task definition does not support launch_type FARGATE.

I checked the registered task definition and I can see:

Copy code

"compatibilities": [
    "EC2"
  ],

When I do the same with 0.14.6 I see:

Copy code

"compatibilities": [
    "EC2",
    "FARGATE"
  ],

Thx!

Zanie

03/15/2021, 5:57 PM

Hi @Robert Bastian -- there's a tracking issue for this at https://github.com/PrefectHQ/prefect/issues/4243

Zanie

03/15/2021, 6:08 PM

Would you mind explaining a bit more of what your setup is? We don't manually set the capabilities.

Robert Bastian

03/15/2021, 9:02 PM

@Zanie Please let me know what else I can provide: Here is my ECSRun:

Copy code

RUN_CONFIG = ECSRun(
    labels=["s3-flow-storage", "rai-fargate-local"],
    image="{redacted}.<http://dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest|dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest>",
    memory="512",
    cpu="256",
)

Storage:

Copy code

STORAGE = S3(bucket="prefect-rai-dev")

Agent:

Copy code

prefect agent ecs start --token {token} --cluster RAI --task-role-arn=arn:aws:iam::{redacted}:role/Prefect_Container_Role --execution-role-arn=arn:aws:iam::{redacted}:role/Prefect_Task_Execution_Role --log-level DEBUG --label rai-fargate-local --label s3-flow-storage --name rai-fargate

Robert Bastian

03/15/2021, 9:05 PM

One other thing I can say after reading the thread is that I have flows that pass a networkConfiguration and flows that don’t. Neither work after upgrading to 0.14.12. I also tried explicitly setting launch-type to “FARGATE”, that also did not work.

Jim Crist-Harif

03/15/2021, 9:07 PM

Before did you ever set a custom task definition template? We never set

compatibilities

in prefect itself, so it's not clear where this change is coming from.

Jim Crist-Harif

03/15/2021, 9:09 PM

We did however cache task definitions, and only register a new task definition if the version of the flow changed. Due to run configs now being configurable on individual flow runs, we no longer can do this (turns out registering a new definition for each run isn't that bad anyway), but before existing task definitions would have persisted until the flow re-registered.

Robert Bastian

03/15/2021, 9:19 PM

I never set a custom task definition, I don’t believe.

Robert Bastian

03/15/2021, 9:19 PM

I’ve always relied on it being generated.

Jim Crist-Harif

03/15/2021, 9:24 PM

Hmmm

Jim Crist-Harif

03/15/2021, 9:25 PM

When you upgraded prefect, did it also bump the boto3 version? Not sure if a change in that would matter, just trying to eliminate other options.

Jim Crist-Harif

03/15/2021, 9:25 PM

Alternatively, if you downgrade prefect (but keep the rest of your environment exactly the same) do things go back to working?

Robert Bastian

03/15/2021, 9:27 PM

Regarding boto3: I don’t believe so because I’ve locked boto3 in the container that runs the ECS agent. Downgrading to 0.14.6 gets flows running again. I was going to try 0.14.11 just to narrow it a bit more.

Robert Bastian

03/15/2021, 9:28 PM

@Jim Crist-Harif Sorry, I misspoke. Here is my Dockerfile for the Agent:

Copy code

FROM python:3.9-slim-buster

ENV PREFECT_VERSION=0.14.6

RUN apt-get update && apt-get install -y gcc

RUN pip install prefect[aws]==${PREFECT_VERSION}

COPY agent.py /agent.py
COPY --from=arpaulnet/s6-overlay-stage:2.0 / /

ENTRYPOINT ["/init"]

CMD ["python", "agent.py"]

I don’t explicitly install boto3. I’m relying on the Prefect “extras”

Robert Bastian

03/15/2021, 9:31 PM

In the container runtime - where the task executes, I do this: RUN pip install boto3==1.16.44

Jim Crist-Harif

03/15/2021, 9:35 PM

It might be a boto3/botocore change then. If you could inspect the environment that was successfully running the agent to determine the working version of boto3 & botocore, and then try it with prefect 0.14.12 but with the same boto3 and botocore versions as before that'd be useful.

👀 1

Jim Crist-Harif

03/15/2021, 9:36 PM

This should only affect the agent environment, so it shouldn't matter what image you're using to run your flows.

Zanie

03/15/2021, 9:37 PM

The only thing I can find in AWS-land is https://github.com/aws/aws-cli/issues/3983

Jim Crist-Harif

03/15/2021, 9:40 PM

Hmmm, so maybe we should be setting

requiresCompatibilities

on our task definitions. Still doesn't explain why the compatibilities changed for Robert when prefect was upgraded though (since we never set those and still don't).

Robert Bastian

03/15/2021, 9:49 PM

jim -

Copy code

root@f88173e2ea59:/# python
Python 3.9.2 (default, Mar 12 2021, 19:04:51)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import boto3
>>> print(boto3.__version__)
1.17.27
>>> import botocore
>>> print(botocore.__version__)
1.20.27
>>> import prefect
>>> print(prefect.__version__)
0.14.12

Zanie

03/15/2021, 9:54 PM

I also discovered this useful changelog--very hard to find where AWS documents API changes otherwise https://awsapichanges.info/archive/service/ecs/

Robert Bastian

03/15/2021, 9:59 PM

0.14.6:

Copy code

root@9b11bf76222c:/# python
Python 3.9.2 (default, Mar 12 2021, 19:04:51)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import prefect
>>> print(prefect.__version__)
0.14.6
>>> import boto3
>>> print(boto3.__version__)
1.17.27
>>> import botocore
>>> print (botocore.__version__)
1.20.27

Robert Bastian

03/15/2021, 10:03 PM

Later I’ll do a completely new flow from scratch to see if it works on 0.14.12.

Jim Crist-Harif

03/15/2021, 10:05 PM

That's unexpected. The image you're using with an older version of prefect (0.14.6) is using the (almost) latest version of boto3 (released 3 days ago)?

Jim Crist-Harif

03/15/2021, 10:05 PM

(not that there's anything wrong with that, I was just expecting there to be a difference in versions)

Jim Crist-Harif

03/15/2021, 10:09 PM

While I'm still curious why things changed, it'd also be good to see if we can get things working for you on 0.14.12. Can you try adding a custom task definition to set

requiresCompatibilities

on your task definition? Something like this should work:

Copy code

import yaml

definition = yaml.safe_load(
"""
networkMode: awsvpc
cpu: 1024
memory: 2048
requiresCompatibilities:
  - name: FARGATE
containerDefinitions:
  - name: flow
"""
)

flow.run_config = ECSRun(task_definition=definition)

Robert Bastian

03/16/2021, 5:24 PM

FYI - 0.14.11 still works also, so this is definitely related to 0.14.12.

Robert Bastian

03/16/2021, 8:01 PM

@Jim Crist-Harif this worked:

Copy code

definition = yaml.safe_load(
    """
    networkMode: awsvpc
    cpu: 1024
    memory: 2048
    requiresCompatibilities:
        - FARGATE
    containerDefinitions:
        - name: flow
    executionRoleArn: aws:iam::{redacted}:role/Prefect_Task_Execution_Role
    """
)

RUN_CONFIG = ECSRun(
    image= '{redacted}.<http://dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest|dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest>',
    task_definition=definition,
    labels=["s3-flow-storage"],
    memory="512",
    cpu="256",
)

It did take me 23 revisions to get the YAML just right (I hate YAML, btw). I also don’t like that the name must equal “flow”. That doesn’t seem intuitive.

Jim Crist-Harif

03/16/2021, 8:15 PM

Excellent, glad to hear it. Still confused what changed in 0.14.12 that broke things for you, but we can set

requiresCompatibilities

ourselves (I think) and everything should still work. Should get a PR in for the next release.

I hate YAML, btw

Yaml isn't required here, it was just the quickest way I could copy over the existing def and add the bit you needed. A nested dict/list structure in python would have worked just as well. If you specify a custom definition via

task_definition_path

, then the file must contain yaml (or json, which is a subset of yaml).

I also don’t like that the name must equal “flow”. That doesn’t seem intuitive.

Since task definitions may contain multiple containers, we need a way to know which container contains the flow for prefect to fill in the image/command/etc... To make it easier for users to define sidecar containers and leave the main container undefined we use the name

flow

as a designator, rather than assuming the first container definition. This is done for both ECS and k8s agents. Note that if you don't customize the containers at all you can leave the

containerDefinitions

field out completely.

👍 1

Robert Bastian

03/16/2021, 11:31 PM

@Jim Crist-Harif OK, I went back an examined the task definitions submitted by 0.14.6 and 0.14.12 and used a diff tool to see what was what. The crux seems to be that the 0.14.6 set the task_execution_role in the task definition while 0.14.12 did not. Because the role was set, EC2 validated that the flow was compatible with FARGATE. When the role is not there, FARGATE is not an option because you need (I assume) special permissions in the execution role. I verified this by removing requriedCompatibilities from the yaml, but leaving the task execution role, and flows work on 0.14.12. Interestingly, setting the task execution role in the RUN_CONFIG does not fix the issue. BTW - When you specify FARGATE in requiredCompatibilies you are forced to set the execution role arn.

Jim Crist-Harif

03/17/2021, 12:28 AM

Because the role was set, EC2 validated that the flow was compatible with FARGATE. When the role is not there, FARGATE is not an option because you need (I assume) special permissions in the execution role.

Ah, interesting. That doesn't happen in my test cluster (0.14.12 deploys fine on fargate in my test cluster), I wonder if there's some cluster-level settings that lead to this behavior? ECS has too many knobs. Since `task_role_arn`/`execution_role_arn` are top-level settings on both the agent and

ECSRun

objects, we moved setting these at runtime so user-provided templates wouldn't need them. It's annoying that

execution_role_arn

seems to be required for at least your configuration. Thanks for the info, this is all useful for figuring out how we should resolve this.

Jim Crist-Harif

03/17/2021, 12:30 AM

What type of ECS cluster are you using? (The doc here refers to them as "Networking only", "EC2 Linux + Networking", and "EC2 Windows + Networking").

Jim Crist-Harif

03/17/2021, 12:37 AM

I've been testing fargate usage with a "Networking only" cluster with all the default settings, and things seem to work fine. Trying to figure out what differs between our setups.

Robert Bastian

03/17/2021, 12:45 AM

Copy code

[rbastian@E007254-MAR18 ecs (master)]$ aws ecs describe-clusters --clusters RAI
{
    "clusters": [
        {
            "clusterArn": "arn:aws:ecs:us-east-1:{redacted}:cluster/RAI",
            "clusterName": "RAI",
            "status": "ACTIVE",
            "registeredContainerInstancesCount": 0,
            "runningTasksCount": 0,
            "pendingTasksCount": 0,
            "activeServicesCount": 0,
            "statistics": [],
            "tags": [],
            "settings": [
                {
                    "name": "containerInsights",
                    "value": "enabled"
                }
            ],
            "capacityProviders": [],
            "defaultCapacityProviderStrategy": []
        }
    ],
    "failures": []
}

Robert Bastian

03/17/2021, 1:02 AM

Even after setting default capacity provider to FARGATE and capacity provider to FARGATE I still need to set the execution_role_arn on the task definition.

Jim Crist-Harif

03/17/2021, 1:17 AM

I have

"capacityProviders": ["FARGATE", "FARGATE_SPOT"]

, but other than that we're identical. How did you create a cluster without capacity providers? When creating a new cluster via the console with no other configuration this is what I get.

Jim Crist-Harif

03/17/2021, 1:20 AM

Ah, with the CLI you don't get any by default. That's annoying. Ok, assuming that with no

capacityProviders

I can replicate your issue, I think I have enough to go on now. Thanks for your help in getting to the bottom of this issue!

👍 1

Robert Bastian

03/17/2021, 5:32 PM

I destroyed my cluster and recreated it from the console. I have both FARGATE and FARGATE_SPOT as capacity providers but no default capacity provider strategy. I still get the same issue where I must have a task def with an execution role specified when I register the flow. I’m going to back down to 0.14.11 until there is a better workaround that adding a taskdef to all my flows.

Jim Crist-Harif

03/17/2021, 5:38 PM

Was that still with a custom task definition, or using the default provided by prefect? Do you get an error saying the task definition is invalid and requires an execution role? Or an error at execution time where e.g. you lack permissions to pull the image (what the execution role is for).

Robert Bastian

03/17/2021, 6:06 PM

The error is “Task definition does not support launch_type FARGATE”. This is the original error. I am using the default task definition yaml. I am using s3 storage with stored_as_script=True My Agent has the execution_role_arn and task_role_arn specified. The only way I can get 0.14.12 to work is to use a custom task definition with the execution_role_arn specified.

Jim Crist-Harif

03/17/2021, 6:07 PM

Hmmm, I wonder if a default execution role was never created for you since you used the cli instead? The docs here are a bit vague (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html), specifically looking at:

An Amazon ECS task execution role is automatically created for you in the Amazon ECS console first-run experience

Jim Crist-Harif

03/17/2021, 6:07 PM

Using the default cluster created from the console everything works fine for me, trying to figure out what's different about your setup leading to this.

Robert Bastian

03/17/2021, 6:07 PM

My new cluster was created using the console.

Jim Crist-Harif

03/17/2021, 6:08 PM

right, but there may be some other bit of state that's missing 🤷

Jim Crist-Harif

03/17/2021, 6:08 PM

In the link I posted above, can you follow the instructions after "You can use the following procedure to check and see if your account already has the Amazon ECS task execution role" to see if that role exists?

👀 1

Robert Bastian

03/17/2021, 6:14 PM

I do not have the default task execution role. I think we created it manually and called it something else. It looks like it has to be named specifically “ecsTaskExecutionRole”.

Jim Crist-Harif

03/17/2021, 6:15 PM

I wonder if ECS does something implicit if that role exists, which is why I don't get the errors you're getting.

Robert Bastian

03/17/2021, 6:16 PM

I don’t have IAM permissions in our AWS tenet. I need to get some help to get it created, then I will retest.

👍 1

Jim Crist-Harif

03/17/2021, 6:17 PM

Thanks for working through this with me, I think we're getting closer to understanding what the issue is here.

Robert Bastian

03/17/2021, 6:17 PM

H2H!

Robert Bastian

03/17/2021, 7:30 PM

OK, so I changed my task execution role to be named “ecsTaskExecutionRole”. This still didn’t help. The weird part of this is that the execution_role_arn I specify does not need to exist. I can put anything including “junk” to make it work. It just cannot be null. I’m going to drop and recreate the cluster again and see if the behavior changes.

Robert Bastian

03/17/2021, 7:42 PM

Dropping and recreating the cluster from the console does not help.

Jim Crist-Harif

03/17/2021, 7:50 PM

This is so weird

Jim Crist-Harif

03/17/2021, 7:51 PM

Hmmm. Perhaps for now I'll revert to setting those on the generated definition & at runtime. This would alleviate your current issue, and only users that were specifying their own definition would run into these limitations (if at all).

Robert Bastian

03/17/2021, 8:00 PM

Can you verify that your task definition in aws does not have an execution role arn set?

Jim Crist-Harif

03/17/2021, 8:05 PM

Copy code

{
  "taskDefinition": {
    "taskDefinitionArn": "...",
    "containerDefinitions": [
      {
        "name": "flow",
        "image": "prefecthq/prefect:0.14.12",
        "cpu": 0,
        "portMappings": [],
        "essential": true,
        "environment": [
          {
            "name": "PREFECT__CONTEXT__IMAGE",
            "value": "prefecthq/prefect:0.14.12"
          }
        ],
        "mountPoints": [],
        "volumesFrom": []
      }
    ],
    "family": "prefect-test-ecs",
    "networkMode": "awsvpc",
    "revision": 8,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
      {
        "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
      },
      {
        "name": "ecs.capability.task-eni"
      }
    ],
    "placementConstraints": [],
    "compatibilities": [
      "EC2",
      "FARGATE"
    ],
    "cpu": "1024",
    "memory": "2048"
  }
}

Robert Bastian

03/17/2021, 8:28 PM

Does your aws user have the policy for the task execution role assigned to it? My task definition has this attribute:

Copy code

"registeredBy": "arn:aws:iam::070551638384:user/prefect-service"

Jim Crist-Harif

03/17/2021, 8:32 PM

nope

Robert Bastian

03/30/2021, 9:48 PM

I did some more testing on this issue and here is what I found. I switched my storage from s3 to github and I changed my run_config to omit the image name and now the flow runs as expected. I then switched my storage back to s3, leaving the image omitted from the run_config and the flow runs as expected. Putting the image in the run_config, regardless of storage type causes the error:

Copy code

Task definition does not support launch_type FARGATE.

You can see in the working task definition below that FARGATE is listed, but the only real difference I can see if that the image is now the default prefect image.

Copy code

{
  "taskDefinition": {
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:070551638384:task-definition/prefect-github-say-hello-flow:8",
    "containerDefinitions": [
      {
        "name": "flow",
        "image": "prefecthq/prefect:0.14.12",
        "cpu": 0,
        "portMappings": [],
        "essential": true,
        "environment": [
          {
            "name": "PREFECT__CONTEXT__IMAGE",
            "value": "prefecthq/prefect:0.14.12"
          }
        ],
        "mountPoints": [],
        "volumesFrom": []
      }
    ],
    "family": "prefect-github-say-hello-flow",
    "networkMode": "awsvpc",
    "revision": 8,
    "volumes": [],
    "status": "INACTIVE",
    "requiresAttributes": [
      {
        "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
      },
      {
        "name": "ecs.capability.task-eni"
      }
    ],
    "placementConstraints": [],
    "compatibilities": [
      "EC2",
      "FARGATE"
    ],
    "cpu": "1024",
    "memory": "2048",
    "registeredAt": 1617140231.937,
    "deregisteredAt": 1617140232.827,
    "registeredBy": "arn:aws:iam::070551638384:user/prefect-service"
  }
}

Just to clarify, my image on ECR is also 0.14.12.

Jim Crist-Harif

03/31/2021, 8:09 PM

Finally was able to reproduce the issue locally, should be fixed by https://github.com/PrefectHQ/prefect/pull/4325 and out in the next release. Thanks for your patience and work investigating this issue.

4 Views

Open in Slack

Previous Next