Hi - we have a flow that occasionally gets stuck i...
# ask-community
m
Hi - we have a flow that occasionally gets stuck in a running state until we manually cancel it. We tried using the automations feature, but we encountered this situation last night and it didn't work
the automation should cancel the flow if does not finish after 65 minutes
as you can see, it ran past 65 minutes, so we manually cancelled it - the automation didnt seem to do anything
k
Hey @Mark McDonald, is the flow expected to get stuck in a running state sometimes?
About the automation, I’ll bring that up to the team
m
thanks @Kevin Kho - it's not expected to get caught in a running state. I wish I could explain why it happens. I run this flow every hour, 24 hours a day, 7 days a week. About once or twice a week, a flow run just seems to get stuck in the running state, unexplainably.
k
I see this happen when Dask itself freezes to being resource constrained. Any signs of that going on?
m
interesting - from our container insights, it looks like there is plenty of available cpu and memory. I will try to dig into this deeper today
z
Hey @Mark McDonald -- Could you do me a favor and give me the flow run id that was not cancelled and in the interactive API query for the hook and share the one for that automation? e.g.
Copy code
query {
  hook {
    action_id
    id
    event_tags
    event_type
  }
}
m
sure - @Zanie this is the flow run that was not cancelled by the automation 7f066f3b-7e0e-453a-a925-6a5f5d1ee485
this is the response for the automation
Copy code
{
  "data": {
    "hook": [
      {
        "action_id": "e9e1ef7a-3b1a-4526-88de-c4318021194a",
        "id": "066c67bb-f3cf-465a-93e0-2fa39e37a2e7",
        "event_tags": {
          "flow_sla_config_id": [
            "24426ad8-c36a-41f0-a97f-3e0f1ea25efe"
          ]
        },
        "event_type": "FlowSLAFailedEvent"
      }
    ]
  }
}
z
Great thanks, I'll look into some logs and get back to you
I'm continuing to investigate this, just fyi
Hey @Mark McDonald -- just to confirm, this run started after you created the automation right?
👍 1
Could you also show me:
Copy code
query {
  flow_sla_config {
    id
    kind
    flow_groups {
      id
    }
    duration_seconds
  }
}
m
@Zanie confirmed
Copy code
{
  "data": {
    "flow_sla_config": [
      {
        "id": "4766b0e8-f8bb-46b3-9bbe-2aac4ec14082",
        "kind": "STARTED_NOT_FINISHED",
        "flow_groups": [
          {
            "id": "0fb8a078-86f5-4812-b7df-f86d462feb9d"
          }
        ],
        "duration_seconds": 3600
      },
      {
        "id": "24426ad8-c36a-41f0-a97f-3e0f1ea25efe",
        "kind": "STARTED_NOT_FINISHED",
        "flow_groups": [
          {
            "id": "49876534-8f63-45e6-96cd-b09ba1344fc8"
          }
        ],
        "duration_seconds": 3900
      }
    ]
  }
}
I think the one with duration is 3600 is what I initially created, and then I updated it to a duration of 3900
again, this was all done before the flow that should have been cancelled ran
z
Hey @Mark McDonald -- so it looks like the flow group
49876534-8f63-45e6-96cd-b09ba1344fc8
does not exist which would be why your SLA was not enforced. Your flow run belongs to the flow group
22f322dd-0201-4769-9246-2a1b6551527c
-- did you delete the flow group after creating the automation and register a new one?
m
so - that's an interesting point, we have multiple projects for our different environments. We have a uat and a prod project that has separate flows by the same name
so, because the UI doesn't allow you to select the project, I assumed you were querying based on flow name
afaik, we never deleted a flow group for this particular flow. We redeploy the flow approximately 1x per week, but the flow group never changes
ahh - ok I think I had the wrong project @Zanie - sorry about that I didn't see the "project name" in the top right corner when I set this up
consider this issue closed - I will let you know if the automation is successful the next time this flow runs past the sla
thank you for the help, and apologies for the false alarm
z
No problem! Glad we got it sorted out.