I'm running flows on ECS using fargate; is it norm...
# prefect-community
c
I'm running flows on ECS using fargate; is it normal to see
RUNNING
tasks like those shown below which have been running for days? Are those truly still running? Am I being billed for idle compute here?
a
I think that RUNNING state indeed indicates that those ECS tasks are still in progress. Can you cross-check and match the ECS tasks with the flow runs in your Prefect Cloud UI? Did the corresponding flow runs finished without any issues? If you are on Prefect Cloud, you can send us the flow run ID so that we can cross check on our end as well. Regarding billing, you can attach cost allocation tags to then check in your AWS Billing dashboard for what exactly are you billed and how much. To attach tags, I think you would need to either modify the existing cluster or create a new one, with CLI it can be done using:
Copy code
aws ecs create-cluster --cluster-name prefectEcsCluster --tags key=keyname,value=actualValue
Then, you would need to use the --propagateTags flag when starting an ECS service for the agent.
also: can you send a summary of your setup? 1. Prefect Cloud or Server? 2. Do you use Fargate or EC2 launch type? 3. How did you start the agent and the flow runs? 4. Any chance you can share one full flow definition one of those that hangs in a running state on ECS?
c
1. Prefect Cloud 2. Fargate 3. we manage the agent through terraform
Copy code
{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::792470144447:role/prefect-ecs-execution-role",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/prefect-tasks",
          "awslogs-region": "us-west-2",
          "awslogs-stream-prefix": "constantino_schillebeeckx-salesforce_extract"
        }
      },
      "entryPoint": null,
      "portMappings": [],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [
        {
          "name": "PREFECT__CONTEXT__IMAGE",
          "value": "<http://792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam|792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam>"
        }
      ],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "<http://792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam|792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam>",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "flow"
    }
  ],
  "placementConstraints": [],
  "memory": "16384",
  "taskRoleArn": "arn:aws:iam::792470144447:role/prefect-ecs-task-role",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-west-2:792470144447:task-definition/prefect-salesforce-extract:88",
  "family": "prefect-salesforce-extract",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "runtimePlatform": null,
  "cpu": "2048",
  "revision": 88,
  "status": "INACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": [],
  "statusString": "(INACTIVE)"
}
Note we're using
prefecthq/prefect:0.15.13-python3.8
for the agent
👍 1
Copy code
PREFECT__CONTEXT__FLOW_ID	e9aaf08c-4beb-4b36-b40d-0a73700f03e7
PREFECT__CONTEXT__FLOW_RUN_ID	352ba0ba-2ea8-4ec2-8acb-fad120376b8d
for the above definition
uho
a
do you happen to know why someone tried to cancel this flow run? for some reason, it looks like someone or some process tried to cancel this flow run but it didn't work - the flow run stayed in a Cancelling state and it still doesn't have end time... Something went wrong here for sure and good catch that you found it now rather than after months.
c
i just cancelled it cause it had been running for 13 days (as shown above)
👍 1
a
maybe you could cancel those runs manually and in the worst case set the state to Cancelled via API and manually stop those zombie ECS tasks. And to avoid that in the future, maybe you can add an Automation to automatically cancel a flow run if it doesn't finish within X time (the max duration of your normal flow run e.g. 4 hours)
oh sorry, I must have been misled because the timestamp of the Cancelling state is 8th of February rather than today
c
i was just gonna ask if there's a way to configure max run time of a flow. what type of automation are you suggesting? another flow that checks in on ECS?
a
we have those flow SLA failure automations allowing you to cancel a flow run if it didn't finish within e.g. 4 hours https://docs.prefect.io/orchestration/concepts/automations.html#flow-sla-failure
so it looks like only one of mapped tasks got stuck for some reason
but you need to configure such Automation SLA for each flow run separately, there's no way to set it once for all flows
c
sadness - ok thanks for all the help - I'll have to build around this
a
Understandable, sorry to hear about this issue and good you found it out!
🙌 1
k
Was this really running for 13 days? I think you can check for your for open database connections because those tend to keep containers running even after flow execution. But normally it would be completed on the Prefect end. This looks like there was some activity otherwise Prefect would mark it as failed (no heartbeat)