https://prefect.io logo
Title
a

andrr

03/22/2022, 12:50 PM
Hey all, šŸ‘‹ We face several problems with flows that run in the Kubernetes cluster. ā€¢ Pods often stuck in the
Running
state with the last message in logs
DEBUG - prefect.CloudFlowRunner | Checking flow run state...
ā€¢ The flow in Prefect Cloud stucks in the
Cancelling
state and the pod stucks in the
Running
state in the Kubernetes cluster. Context: ā€¢ prefect version
0.15.13
ā€¢ Private Azure AKS cluster ā€¢ We've tried to set
PREFECT__CLOUD__HEARTBEAT_MODE
to
"thread"
, but it only got worse (more stucked pods in the
Running
state). Now we have
PREFECT__CLOUD__HEARTBEAT_MODE
with
"process"
value and
tini -- prefect execute flow-run
as PID 1 to handle zombie process. It seems like the problem with the heartbeat process detecting the change to
Cancelling
or
Cancelled
states of the flow. I appreciate any help, thanks šŸ™‚
k

Kevin Kho

03/22/2022, 2:22 PM
Hey @andrr, is the pod actually dying here?
a

andrr

03/22/2022, 2:27 PM
hey, No, pod stuck in the
Running
state. I have to check the pod with
kubectl describe pod
command and check
PREFECT__CONTEXT__FLOW_RUN_ID
env variable, then I check the Prefect Cloud Interface with this ID, check the Flow have
Cancelling
or
Cancelled
state and then delete the pod mannualy.
some example, that I have right now
kubectl get pod prefect-job-27524014--1-7zddl
NAME                            READY   STATUS    RESTARTS   AGE
prefect-job-27524014--1-7zddl   1/1     Running   0          150m
I see the next lines in logs of that pod
kubectl logs prefect-job-27524014--1-7zddl
...
[2022-03-22 12:01:52+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
The flow has the
Cancelled
state in the Prefect Cloud
[22 March 2022 2:15pm]: Andrr marked flow run as Cancelled
Pod stuck in the
Running
state more than 2 hours
kubectl exec prefect-job-27524014--1-7zddl -- ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   2300   528 ?        Ss   12:01   0:00 tini -- prefect execute flow-run
root        15  0.2  0.4 1261416 139424 ?      Sl   12:01   0:21 /usr/local/bin/python /usr/local/bin/prefect execute flow-run
root        18  0.1  0.1 139372 50532 ?        Sl   12:01   0:11 /usr/local/bin/python -m prefect heartbeat flow-run -i 5b47d02d-xxx-xxx-0303a69f8824
root        41  0.0  0.0   8584  3240 ?        Rs   14:39   0:00 ps aux
It is the last line in pod log
DEBUG - prefect.CloudFlowRunner | Checking flow run state...
a

Anna Geller

03/22/2022, 3:08 PM
This Discourse topic explains the problem a bit more and provides some solutions you may try https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79 1. Are you on Cloud or Server? 2. If you are on Cloud, can you send us the flow run ID? 3. Any chance you could send us your flow or just explain what your flow is doing? is this doing work in some e external service such as AWS Batch or Databricks?
a

andrr

03/22/2022, 3:24 PM
Hey Anna, thanks for the reply! I've checked the provided link. Does the error from the link
No heartbeat detected from the remote task
mean the same that
prefect.CloudFlowRunner | Checking flow run state...
? Also, we do not have the out of memory issues. 1) Setting heartbeat mode to threads We've tried this, but after that every manually cancelled flow just stucked in the
Running
state. Then we switched back to
process
heartbeat mode and now only some of the cancelled flow's pod stuck in the
Running
state, not every. And we use
KubernetesRun
instead of
UniversalRun
, can it be a major point?
Are you on Cloud or Server?
We use Cloud
If you are on Cloud, can you send us the flow run ID?
5b47d02d-1eb6-4c4b-9167-0303a69f8824
Any chance you could send us your flow or just explain what your flow is doing?
Inside the flow we transfer the data between the databases with the intermediate saving in the temporary file. Long-run SQL execs with using the snowflake.connector.
Tasks inside the flow we define just with the
@task
directive and python
def
(no
ShellTask
or
DatabricksSubmitRun
). Example
@task
def example():
    ...
k

Kevin Kho

03/22/2022, 4:14 PM
Is the cancellation here triggered by you or you mean Prefect just marks those an cancelled?
a

andrr

03/22/2022, 4:30 PM
The
Cancelled
state was triggered manually by us. I'm not sure about the
Cancelling
state.
Am I right, that the
Cancelled
state is a part of the global
Finished
state (docs)? The expected behavior that the heartbeat process detect the flow marked as cancelled and shut down the pod. Or maybe I miss something?
a

Anna Geller

03/22/2022, 5:03 PM
@andrr Thanks for sharing the ID, the logs were helpful. Let me recap and check whether I understood the problem you are facing. Your flow is executing some long-running Snowflake query triggered from a Kubernetes flow run pod on Azure AKS. The query cannot complete for some reason and both the task run and the flow run keep staying in a Running state until the flow run gets Cancelled by Flow SLA which is set for 1 hour. Questions/Things you may try or check: 1. Can it be that you are not closing the DB connection? Can you DM us your Flow to check? 2. If the query takes longer than 1 hour to complete, you need to remove or modify your SLA-Automation 3. Is the problem that you see that the flow run actually finishes all work successfully but only fails to clean up the underlying flow run pod? Can you say if the actual work doing the Snowflake query etc was performed successfully in the flow run "5b47d02d-1eb6-4c4b-9167-0303a69f8824"?
a

andrr

03/22/2022, 5:07 PM
Anna, thanks for the detailed answer, I need some time to discuss it with the team.
šŸ‘ 2
a

Anna Geller

03/22/2022, 5:10 PM
The ideal end goal would be to not have to Cancel this flow run and make sure it either completes successfully or fails.
a

andrr

03/22/2022, 5:11 PM
Got it, thank you a lot!
The ideal end goal would be to not have to Cancel this flow run and make sure it either completes successfully or fails.
Can we mark flow, that runs more then X minutes (the time that this flows is expected to be done) as a
Failed
flow? Right now we can only mark it with the
Cancelling
state.