Hello I am looking to use Prefect to help scale some large b Prefect Community #ask-community

Hello, I am looking to use Prefect to help scale s...

Matthew Millendorf

04/07/2021, 6:12 PM

Hello, I am looking to use Prefect to help scale some large batch processes with ECS (Fargate) and Dask and am wondering if anyone has any insight on doing this or could point me to some resources. Additionally, I am having some difficulty figuring out why my submitted Flow’s ECS Task immediately becomes INVALID when provisioning resources.

Kevin Kho

04/07/2021, 6:19 PM

Hi @Matthew Millendorf, what Prefect version are you on?

Kevin Kho

04/07/2021, 6:20 PM

Is the error you’re seeing something like this?

Matthew Millendorf

04/07/2021, 6:26 PM

Hi @Kevin Kho, thanks for the prompt response. I did see this thread and unfortunately it’s not the same error. On the terminal where I am running my agent, there is no error, last INFO log is ‘Deploying flow run <uuid>’ In my ECS Console, the Task Definition immediately becomes INACTIVE and the status goes from PROVISIONING to PENDING almost immediately.

Matthew Millendorf

04/07/2021, 6:29 PM

And I am on prefect 0.14.12 btw

Kevin Kho

04/07/2021, 6:32 PM

Ok the changes of the issue above were fixed in 0.14.15, but I think you have a separate issue anyway. Let me ask the team and get back to you. Is this a flow you were able to successfully run before?

Kevin Kho

04/07/2021, 6:36 PM

Where is your Agent running?

Matthew Millendorf

04/07/2021, 6:36 PM

Okay cool, thanks for the help. Yep, I’m actually just using the Flow from this PR - pretty barebones. https://github.com/PrefectHQ/prefect/pull/3585

Matthew Millendorf

04/07/2021, 6:39 PM

Additionally, just curious if this is the optimal way to find a solution to my problem. Trying to run some large parallelized Flows. Since Dask handles the execution across many nodes, I figured I could use Fargate to provide a bunch of containers and thus, that nice Prefect map function can run across many containers and get some good speed ups.

Matthew Millendorf

04/07/2021, 6:40 PM

Wondering if you’ve seen better ways to scale Flows in this manner.

Kevin Kho

04/07/2021, 6:42 PM

Are you using your own task definition? or the default Prefect one?

Kevin Kho

04/07/2021, 6:42 PM

Yes this sounds like the way to go to scale your flows

Kevin Kho

04/07/2021, 6:59 PM

Is your flow using a lot of memory from the start?

Matthew Millendorf

04/07/2021, 8:11 PM

So I started with my own task definition and runtime parameters but am now back to barebones where all values for the ECS run config and the ECS agent are default except for the name of my cluster and the launch type as fargate. Same issue still persists. Just verifying, the ECS agent does not need to be running on AWS correct? I have it running locally.

Kevin Kho

04/07/2021, 8:12 PM

yeah that should be fine. I think it may be a memory issue and you need to up the resources.

Kevin Kho

04/07/2021, 8:13 PM

tps://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-cpu-memory-error.html

Kevin Kho

04/07/2021, 8:23 PM

Does you agent pick up the flow (local command line?)

Kevin Kho

04/07/2021, 8:24 PM

https://aws.amazon.com/premiumsupport/knowledge-center/ecs-tasks-stuck-pending-state/ this says you may need to increase container instance as well

Matthew Millendorf

04/07/2021, 8:25 PM

just maxed out CPU and Memory on the task definition and still having the issue.

Copy code

[2021-04-07 20:23:19,677] INFO - agent | Waiting for flow runs...
[2021-04-07 20:24:05,617] INFO - agent | Found 1 flow run(s) to submit for execution.
[2021-04-07 20:24:05,712] INFO - agent | Deploying flow run '840c7de3-e8a0-4d28-892c-4f66f7f0ddbe'

yep local agent is picking it up. I see my task on my ECS console, it is just stuck in pending state.

Matthew Millendorf

04/07/2021, 8:26 PM

okay ill give that a shot.

Kevin Kho

04/07/2021, 8:28 PM

Ok that’s good so we’re a bit clear it’s on the ECS side

Matthew Millendorf

04/07/2021, 8:30 PM

Yeah. My thing thought is , the company I am at, we use Fargate quite a bit so im pretty much re-using task definitions and runtime parameters from our repository here. That’s why I’m a bit confused. But just clarifying, if the agent sends the tsak to ECS, its gotta be on the ECS side? Im a bit curious about the agent overwriting some fields for the task definition and runtime parameters? I notice sometimes the prefect image gets used and not the one i specified, and some other params.

Kevin Kho

04/07/2021, 8:39 PM

I see. Yes I’m more inclined to believe this is on the ECS side because the flow run was deployed. Yes you can override some task definition parameter through the Prefect RunConfig. Check examples here. Maybe you want the last example?

Kevin Kho

04/07/2021, 8:40 PM

I will mention though that debugging ECS is a pain point on our front. There was someone with different issues last Monday. This is on our radar and we intend to produce a minimum working example in a blog post in the next week.

Matthew Millendorf

04/07/2021, 8:42 PM

Cool cool - well thank you for help Kevin.

Kevin Kho

04/07/2021, 8:42 PM

I will add it on my list to ping you when I finish that

Matthew Millendorf

04/07/2021, 8:43 PM

great thanks!

Eddie Obropta

04/09/2021, 5:59 PM

Hi @Kevin Kho, I work with Matthew. Also attempted a super simple flow without a custom task definition and get similar issue.

Eddie Obropta

04/09/2021, 5:59 PM

I tried with Prefect Cloud instead of Prefect Server.

Eddie Obropta

04/09/2021, 6:01 PM

This was the task definition that Prefect generated:

Copy code

{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::XXXXX:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [
        {
          "name": "PREFECT__CONTEXT__IMAGE",
          "value": "prefecthq/prefect:latest-python3.6"
        }
      ],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "prefecthq/prefect:latest-python3.6",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "flow"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": null,
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-west-2:XXXXXXX:task-definition/prefect-hello-flow:4",
  "family": "prefect-hello-flow",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 4,
  "status": "INACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": [],
  "statusString": "(INACTIVE)"
}

I “xxx’d” out our aws account number.

Eddie Obropta

04/09/2021, 6:02 PM

It’s the default ecs task execution role in IAM:

Eddie Obropta

04/09/2021, 6:03 PM

Copy code

prefect agent ecs start --cluster arn:aws:ecs:us-west-2:XXXXX:cluster/prefect-cluster --execution-role-arn arn:aws:iam::XXXXX:role/ecsTaskExecutionRole --label Raptor-Macbook-Raptor-2.local

And here’s how I started my ECS agent. ^ I think with those messages you have everything you need to help us debug.

Eddie Obropta

04/09/2021, 6:05 PM

I think it’s just throwing a “ModuleNotFoundError” the moment the container spins up, but I have no idea where that is getting set or needs to be overrode.

Kevin Kho

04/09/2021, 6:05 PM

The error seems like it might be the Dockerfile. Could you share that?

Eddie Obropta

04/09/2021, 6:06 PM

So the only place I specify a Docker image is in my hello_flow.py file here:

Copy code

flow.run_config = ECSRun(
    image="prefecthq/prefect:latest-python3.6")

Eddie Obropta

04/09/2021, 6:06 PM

^ and it’s the image from prefect. No modifications.

Kevin Kho

04/09/2021, 6:07 PM

Ok I will look into this

Eddie Obropta

04/09/2021, 6:09 PM

Sweet. Thanks!

Eddie Obropta

04/09/2021, 6:12 PM

Here’s a better screenshot of Prefect Cloud. Realized that was a small screenshot I previously sent.

🙏 1

Kevin Kho

04/09/2021, 6:20 PM

@Eddie Obropta, there is one more thing that will help me. Could you give me the output of

prefect diagnostics

Kevin Kho

04/09/2021, 6:22 PM

I assume the agent is running on the same machine that registered the flow?

Eddie Obropta

04/09/2021, 6:25 PM

Yep I registed the flow from my Mac and have the ECS agent running in another terminal window on my Mac:

Copy code

$ prefect diagnostics
{
  "config_overrides": {
    "cloud": {
      "agent": {
        "auth_token": true
      }
    }
  },
  "env_vars": [],
  "system_information": {
    "platform": "Darwin-19.6.0-x86_64-i386-64bit",
    "prefect_backend": "cloud",
    "prefect_version": "0.14.15",
    "python_version": "3.6.8"
  }
}

Kevin Kho

04/09/2021, 6:29 PM

thanks!

👍 1

Kevin Kho

04/09/2021, 7:25 PM

Hi @Eddie Obropta, the error we’re seeing here happens because there is a default storage called

LocalStorage

. By default LocalStorage will save your serialized flow in Users/username/. If you combine

LocalRun

with

LocalStorage

, what happens is you’re telling Prefect that the serialized script lives in that Users/username location. The agent will find that serialized Flow and unpickle it and run it.

Kevin Kho

04/09/2021, 7:29 PM

So now what happens is that we’re using the default

LocalStorage

along with the

ECSRun

config. It downloads the image and tries to find that serialized flow inside the Docker container. But it doesn’t live there because it’s on

LocalStorage

. For your specific example, we need to configure a

Storage

that your Agent can each. Since you’re already on AWS, we recommend S3 Storage . Specifying the S3 storage will store your script when you register the flow. Now your ECSRun will be able to grab that script correctly (the error is because it incorrectly tried to grab it from Users/username/)

Kevin Kho

04/09/2021, 7:31 PM

I have spoken with the core team and I will be creating an issue to have more descriptive messages around the usage of the default

LocalStorage

Eddie Obropta

04/09/2021, 7:48 PM

Hi Kevin! Sweet. We’ll give that a shot. Thanks for digging in.

Kevin Kho

04/09/2021, 7:50 PM

Here is our issue for this: https://github.com/PrefectHQ/prefect/issues/4386

👏 1

Kevin Kho

04/09/2021, 7:52 PM

And I haven’t forgotten about Matthew! Will ping him when I get a tutorial together 😅

Open in Slack

Previous Next