andrr
03/22/2022, 12:50 PMRunning
state with the last message in logs DEBUG - prefect.CloudFlowRunner | Checking flow run state...
⢠The flow in Prefect Cloud stucks in the Cancelling
state and the pod stucks in the Running
state in the Kubernetes cluster.
Context:
⢠prefect version 0.15.13
⢠Private Azure AKS cluster
⢠We've tried to set PREFECT__CLOUD__HEARTBEAT_MODE
to "thread"
, but it only got worse (more stucked pods in the Running
state). Now we have PREFECT__CLOUD__HEARTBEAT_MODE
with "process"
value and tini -- prefect execute flow-run
as PID 1 to handle zombie process.
It seems like the problem with the heartbeat process detecting the change to Cancelling
or Cancelled
states of the flow.
I appreciate any help, thanks šKevin Kho
andrr
03/22/2022, 2:27 PMRunning
state. I have to check the pod with kubectl describe pod
command and check PREFECT__CONTEXT__FLOW_RUN_ID
env variable, then I check the Prefect Cloud Interface with this ID, check the Flow have Cancelling
or Cancelled
state and then delete the pod mannualy.andrr
03/22/2022, 2:35 PMkubectl get pod prefect-job-27524014--1-7zddl
NAME READY STATUS RESTARTS AGE
prefect-job-27524014--1-7zddl 1/1 Running 0 150m
I see the next lines in logs of that pod
kubectl logs prefect-job-27524014--1-7zddl
...
[2022-03-22 12:01:52+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
The flow has the Cancelled
state in the Prefect Cloud
[22 March 2022 2:15pm]: Andrr marked flow run as Cancelled
Pod stuck in the Running
state more than 2 hours
kubectl exec prefect-job-27524014--1-7zddl -- ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2300 528 ? Ss 12:01 0:00 tini -- prefect execute flow-run
root 15 0.2 0.4 1261416 139424 ? Sl 12:01 0:21 /usr/local/bin/python /usr/local/bin/prefect execute flow-run
root 18 0.1 0.1 139372 50532 ? Sl 12:01 0:11 /usr/local/bin/python -m prefect heartbeat flow-run -i 5b47d02d-xxx-xxx-0303a69f8824
root 41 0.0 0.0 8584 3240 ? Rs 14:39 0:00 ps aux
andrr
03/22/2022, 2:36 PMDEBUG - prefect.CloudFlowRunner | Checking flow run state...
Anna Geller
andrr
03/22/2022, 3:24 PMNo heartbeat detected from the remote task
mean the same that prefect.CloudFlowRunner | Checking flow run state...
? Also, we do not have the out of memory issues.
1) Setting heartbeat mode to threads
We've tried this, but after that every manually cancelled flow just stucked in the Running
state. Then we switched back to process
heartbeat mode and now only some of the cancelled flow's pod stuck in the Running
state, not every.
And we use KubernetesRun
instead of UniversalRun
, can it be a major point?andrr
03/22/2022, 3:34 PMAre you on Cloud or Server?We use Cloud
If you are on Cloud, can you send us the flow run ID?
5b47d02d-1eb6-4c4b-9167-0303a69f8824
Any chance you could send us your flow or just explain what your flow is doing?Inside the flow we transfer the data between the databases with the intermediate saving in the temporary file. Long-run SQL execs with using the snowflake.connector.
andrr
03/22/2022, 3:42 PM@task
directive and python def
(no ShellTask
or DatabricksSubmitRun
). Example
@task
def example():
...
Kevin Kho
andrr
03/22/2022, 4:30 PMCancelled
state was triggered manually by us. I'm not sure about the Cancelling
state.andrr
03/22/2022, 4:35 PMCancelled
state is a part of the global Finished
state (docs)? The expected behavior that the heartbeat process detect the flow marked as cancelled and shut down the pod. Or maybe I miss something?Anna Geller
andrr
03/22/2022, 5:07 PMAnna Geller
andrr
03/22/2022, 5:11 PMandrr
03/23/2022, 9:38 AMThe ideal end goal would be to not have to Cancel this flow run and make sure it either completes successfully or fails.Can we mark flow, that runs more then X minutes (the time that this flows is expected to be done) as a
Failed
flow? Right now we can only mark it with the Cancelling
state.