Hello, I am looking to use Prefect to help scale s...
# ask-community
m
Hello, I am looking to use Prefect to help scale some large batch processes with ECS (Fargate) and Dask and am wondering if anyone has any insight on doing this or could point me to some resources.  Additionally, I am having some difficulty figuring out why my submitted Flow’s ECS Task immediately becomes INVALID when provisioning resources.
k
Hi @Matthew Millendorf, what Prefect version are you on?
Is the error you’re seeing something like this?
m
Hi @Kevin Kho, thanks for the prompt response. I did see this thread and unfortunately it’s not the same error. On the terminal where I am running my agent, there is no error, last INFO log is ‘Deploying flow run <uuid>’ In my ECS Console, the Task Definition immediately becomes INACTIVE and the status goes from PROVISIONING to PENDING almost immediately.
And I am on prefect 0.14.12 btw
k
Ok the changes of the issue above were fixed in 0.14.15, but I think you have a separate issue anyway. Let me ask the team and get back to you. Is this a flow you were able to successfully run before?
Where is your Agent running?
m
Okay cool, thanks for the help. Yep, I’m actually just using the Flow from this PR - pretty barebones. https://github.com/PrefectHQ/prefect/pull/3585
Additionally, just curious if this is the optimal way to find a solution to my problem. Trying to run some large parallelized Flows. Since Dask handles the execution across many nodes, I figured I could use Fargate to provide a bunch of containers and thus, that nice Prefect map function can run across many containers and get some good speed ups.
Wondering if you’ve seen better ways to scale Flows in this manner.
k
Are you using your own task definition? or the default Prefect one?
Yes this sounds like the way to go to scale your flows
Is your flow using a lot of memory from the start?
m
So I started with my own task definition and runtime parameters but am now back to barebones where all values for the ECS run config and the ECS agent are default except for the name of my cluster and the launch type as fargate. Same issue still persists. Just verifying, the ECS agent does not need to be running on AWS correct? I have it running locally.
k
yeah that should be fine. I think it may be a memory issue and you need to up the resources.
Does you agent pick up the flow (local command line?)
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-tasks-stuck-pending-state/ this says you may need to increase container instance as well
m
just maxed out CPU and Memory on the task definition and still having the issue.
Copy code
[2021-04-07 20:23:19,677] INFO - agent | Waiting for flow runs...
[2021-04-07 20:24:05,617] INFO - agent | Found 1 flow run(s) to submit for execution.
[2021-04-07 20:24:05,712] INFO - agent | Deploying flow run '840c7de3-e8a0-4d28-892c-4f66f7f0ddbe'
yep local agent is picking it up. I see my task on my ECS console, it is just stuck in pending state.
okay ill give that a shot.
k
Ok that’s good so we’re a bit clear it’s on the ECS side
m
Yeah. My thing thought is , the company I am at, we use Fargate quite a bit so im pretty much re-using task definitions and runtime parameters from our repository here. That’s why I’m a bit confused. But just clarifying, if the agent sends the tsak to ECS, its gotta be on the ECS side? Im a bit curious about the agent overwriting some fields for the task definition and runtime parameters? I notice sometimes the prefect image gets used and not the one i specified, and some other params.
k
I see. Yes I’m more inclined to believe this is on the ECS side because the flow run was deployed. Yes you can override some task definition parameter through the Prefect RunConfig. Check examples here. Maybe you want the last example?
I will mention though that debugging ECS is a pain point on our front. There was someone with different issues last Monday. This is on our radar and we intend to produce a minimum working example in a blog post in the next week.
m
Cool cool - well thank you for help Kevin.
k
I will add it on my list to ping you when I finish that
m
great thanks!
e
Hi @Kevin Kho, I work with Matthew. Also attempted a super simple flow without a custom task definition and get similar issue.
I tried with Prefect Cloud instead of Prefect Server.
This was the task definition that Prefect generated:
Copy code
{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::XXXXX:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [
        {
          "name": "PREFECT__CONTEXT__IMAGE",
          "value": "prefecthq/prefect:latest-python3.6"
        }
      ],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "prefecthq/prefect:latest-python3.6",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "flow"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": null,
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-west-2:XXXXXXX:task-definition/prefect-hello-flow:4",
  "family": "prefect-hello-flow",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 4,
  "status": "INACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": [],
  "statusString": "(INACTIVE)"
}
I “xxx’d” out our aws account number.
It’s the default ecs task execution role in IAM:
Copy code
prefect agent ecs start --cluster arn:aws:ecs:us-west-2:XXXXX:cluster/prefect-cluster --execution-role-arn arn:aws:iam::XXXXX:role/ecsTaskExecutionRole --label Raptor-Macbook-Raptor-2.local
And here’s how I started my ECS agent. ^ I think with those messages you have everything you need to help us debug.
I think it’s just throwing a “ModuleNotFoundError” the moment the container spins up, but I have no idea where that is getting set or needs to be overrode.
k
The error seems like it might be the Dockerfile. Could you share that?
e
So the only place I specify a Docker image is in my hello_flow.py file here:
Copy code
flow.run_config = ECSRun(
    image="prefecthq/prefect:latest-python3.6")
^ and it’s the image from prefect. No modifications.
k
Ok I will look into this
e
Sweet. Thanks!
Here’s a better screenshot of Prefect Cloud. Realized that was a small screenshot I previously sent.
🙏 1
k
@Eddie Obropta, there is one more thing that will help me. Could you give me the output of
prefect diagnostics
?
I assume the agent is running on the same machine that registered the flow?
e
Yep I registed the flow from my Mac and have the ECS agent running in another terminal window on my Mac:
Copy code
$ prefect diagnostics
{
  "config_overrides": {
    "cloud": {
      "agent": {
        "auth_token": true
      }
    }
  },
  "env_vars": [],
  "system_information": {
    "platform": "Darwin-19.6.0-x86_64-i386-64bit",
    "prefect_backend": "cloud",
    "prefect_version": "0.14.15",
    "python_version": "3.6.8"
  }
}
k
thanks!
👍 1
Hi @Eddie Obropta, the error we’re seeing here happens because there is a default storage called
LocalStorage
. By default LocalStorage will save your serialized flow in Users/username/. If you combine
LocalRun
with
LocalStorage
, what happens is you’re telling Prefect that the serialized script lives in that Users/username location. The agent will find that serialized Flow and unpickle it and run it.
So now what happens is that we’re using the default
LocalStorage
along with the
ECSRun
config. It downloads the image and tries to find that serialized flow inside the Docker container. But it doesn’t live there because it’s on
LocalStorage
. For your specific example, we need to configure a
Storage
that your Agent can each. Since you’re already on AWS, we recommend S3 Storage . Specifying the S3 storage will store your script when you register the flow. Now your ECSRun will be able to grab that script correctly (the error is because it incorrectly tried to grab it from Users/username/)
I have spoken with the core team and I will be creating an issue to have more descriptive messages around the usage of the default
LocalStorage
.
e
Hi Kevin! Sweet. We’ll give that a shot. Thanks for digging in.
k
👏 1
And I haven’t forgotten about Matthew! Will ping him when I get a tutorial together 😅