Thread
#prefect-community
    Constantino Schillebeeckx

    Constantino Schillebeeckx

    7 months ago
    I'm running flows on ECS using fargate; is it normal to see
    RUNNING
    tasks like those shown below which have been running for days? Are those truly still running? Am I being billed for idle compute here?
    Anna Geller

    Anna Geller

    7 months ago
    I think that RUNNING state indeed indicates that those ECS tasks are still in progress. Can you cross-check and match the ECS tasks with the flow runs in your Prefect Cloud UI? Did the corresponding flow runs finished without any issues? If you are on Prefect Cloud, you can send us the flow run ID so that we can cross check on our end as well. Regarding billing, you can attach cost allocation tags to then check in your AWS Billing dashboard for what exactly are you billed and how much. To attach tags, I think you would need to either modify the existing cluster or create a new one, with CLI it can be done using:
    aws ecs create-cluster --cluster-name prefectEcsCluster --tags key=keyname,value=actualValue
    Then, you would need to use the --propagateTags flag when starting an ECS service for the agent.
    also: can you send a summary of your setup? 1. Prefect Cloud or Server? 2. Do you use Fargate or EC2 launch type? 3. How did you start the agent and the flow runs? 4. Any chance you can share one full flow definition one of those that hangs in a running state on ECS?
    Constantino Schillebeeckx

    Constantino Schillebeeckx

    7 months ago
    1. Prefect Cloud 2. Fargate 3. we manage the agent through terraform
    {
      "ipcMode": null,
      "executionRoleArn": "arn:aws:iam::792470144447:role/prefect-ecs-execution-role",
      "containerDefinitions": [
        {
          "dnsSearchDomains": null,
          "environmentFiles": null,
          "logConfiguration": {
            "logDriver": "awslogs",
            "secretOptions": null,
            "options": {
              "awslogs-group": "/ecs/prefect-tasks",
              "awslogs-region": "us-west-2",
              "awslogs-stream-prefix": "constantino_schillebeeckx-salesforce_extract"
            }
          },
          "entryPoint": null,
          "portMappings": [],
          "command": null,
          "linuxParameters": null,
          "cpu": 0,
          "environment": [
            {
              "name": "PREFECT__CONTEXT__IMAGE",
              "value": "<http://792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam|792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam>"
            }
          ],
          "resourceRequirements": null,
          "ulimits": null,
          "dnsServers": null,
          "mountPoints": [],
          "workingDirectory": null,
          "secrets": null,
          "dockerSecurityOptions": null,
          "memory": null,
          "memoryReservation": null,
          "volumesFrom": [],
          "stopTimeout": null,
          "image": "<http://792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam|792470144447.dkr.ecr.us-west-2.amazonaws.com/dwh:cleanup_iam>",
          "startTimeout": null,
          "firelensConfiguration": null,
          "dependsOn": null,
          "disableNetworking": null,
          "interactive": null,
          "healthCheck": null,
          "essential": true,
          "links": null,
          "hostname": null,
          "extraHosts": null,
          "pseudoTerminal": null,
          "user": null,
          "readonlyRootFilesystem": null,
          "dockerLabels": null,
          "systemControls": null,
          "privileged": null,
          "name": "flow"
        }
      ],
      "placementConstraints": [],
      "memory": "16384",
      "taskRoleArn": "arn:aws:iam::792470144447:role/prefect-ecs-task-role",
      "compatibilities": [
        "EC2",
        "FARGATE"
      ],
      "taskDefinitionArn": "arn:aws:ecs:us-west-2:792470144447:task-definition/prefect-salesforce-extract:88",
      "family": "prefect-salesforce-extract",
      "requiresAttributes": [
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "ecs.capability.execution-role-awslogs"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
          "targetId": null,
          "targetType": null,
          "value": null,
          "name": "ecs.capability.task-eni"
        }
      ],
      "pidMode": null,
      "requiresCompatibilities": [
        "FARGATE"
      ],
      "networkMode": "awsvpc",
      "runtimePlatform": null,
      "cpu": "2048",
      "revision": 88,
      "status": "INACTIVE",
      "inferenceAccelerators": null,
      "proxyConfiguration": null,
      "volumes": [],
      "statusString": "(INACTIVE)"
    }
    Note we're using
    prefecthq/prefect:0.15.13-python3.8
    for the agent
    PREFECT__CONTEXT__FLOW_ID	e9aaf08c-4beb-4b36-b40d-0a73700f03e7
    PREFECT__CONTEXT__FLOW_RUN_ID	352ba0ba-2ea8-4ec2-8acb-fad120376b8d
    for the above definition
    uho
    Anna Geller

    Anna Geller

    7 months ago
    do you happen to know why someone tried to cancel this flow run? for some reason, it looks like someone or some process tried to cancel this flow run but it didn't work - the flow run stayed in a Cancelling state and it still doesn't have end time... Something went wrong here for sure and good catch that you found it now rather than after months.
    Constantino Schillebeeckx

    Constantino Schillebeeckx

    7 months ago
    i just cancelled it cause it had been running for 13 days (as shown above)
    Anna Geller

    Anna Geller

    7 months ago
    maybe you could cancel those runs manually and in the worst case set the state to Cancelled via API and manually stop those zombie ECS tasks. And to avoid that in the future, maybe you can add an Automation to automatically cancel a flow run if it doesn't finish within X time (the max duration of your normal flow run e.g. 4 hours)
    oh sorry, I must have been misled because the timestamp of the Cancelling state is 8th of February rather than today
    Constantino Schillebeeckx

    Constantino Schillebeeckx

    7 months ago
    i was just gonna ask if there's a way to configure max run time of a flow. what type of automation are you suggesting? another flow that checks in on ECS?
    Anna Geller

    Anna Geller

    7 months ago
    we have those flow SLA failure automations allowing you to cancel a flow run if it didn't finish within e.g. 4 hours https://docs.prefect.io/orchestration/concepts/automations.html#flow-sla-failure
    so it looks like only one of mapped tasks got stuck for some reason
    but you need to configure such Automation SLA for each flow run separately, there's no way to set it once for all flows
    Constantino Schillebeeckx

    Constantino Schillebeeckx

    7 months ago
    sadness - ok thanks for all the help - I'll have to build around this
    Anna Geller

    Anna Geller

    7 months ago
    Understandable, sorry to hear about this issue and good you found it out!
    Kevin Kho

    Kevin Kho

    7 months ago
    Was this really running for 13 days? I think you can check for your for open database connections because those tend to keep containers running even after flow execution. But normally it would be completed on the Prefect end. This looks like there was some activity otherwise Prefect would mark it as failed (no heartbeat)