Hey all wave We face several problems with flows that run in Prefect Community #ask-community

Hey all, :wave: We face several problems with flo...

andrr

03/22/2022, 12:50 PM

Hey all, 👋 We face several problems with flows that run in the Kubernetes cluster. • Pods often stuck in the

Running

state with the last message in logs

DEBUG - prefect.CloudFlowRunner | Checking flow run state...

• The flow in Prefect Cloud stucks in the

Cancelling

state and the pod stucks in the

Running

state in the Kubernetes cluster. Context: • prefect version

0.15.13

• Private Azure AKS cluster • We've tried to set

PREFECT__CLOUD__HEARTBEAT_MODE

"thread"

, but it only got worse (more stucked pods in the

Running

state). Now we have

PREFECT__CLOUD__HEARTBEAT_MODE

with

"process"

value and

tini -- prefect execute flow-run

as PID 1 to handle zombie process. It seems like the problem with the heartbeat process detecting the change to

Cancelling

Cancelled

states of the flow. I appreciate any help, thanks 🙂

Kevin Kho

03/22/2022, 2:22 PM

Hey @andrr, is the pod actually dying here?

andrr

03/22/2022, 2:27 PM

hey, No, pod stuck in the

Running

state. I have to check the pod with

kubectl describe pod

command and check

PREFECT__CONTEXT__FLOW_RUN_ID

env variable, then I check the Prefect Cloud Interface with this ID, check the Flow have

Cancelling

Cancelled

state and then delete the pod mannualy.

andrr

03/22/2022, 2:35 PM

some example, that I have right now

Copy code

kubectl get pod prefect-job-27524014--1-7zddl
NAME                            READY   STATUS    RESTARTS   AGE
prefect-job-27524014--1-7zddl   1/1     Running   0          150m

I see the next lines in logs of that pod

Copy code

kubectl logs prefect-job-27524014--1-7zddl
...
[2022-03-22 12:01:52+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...

The flow has the

Cancelled

state in the Prefect Cloud

[22 March 2022 2:15pm]: Andrr marked flow run as Cancelled

Pod stuck in the

Running

state more than 2 hours

Copy code

kubectl exec prefect-job-27524014--1-7zddl -- ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   2300   528 ?        Ss   12:01   0:00 tini -- prefect execute flow-run
root        15  0.2  0.4 1261416 139424 ?      Sl   12:01   0:21 /usr/local/bin/python /usr/local/bin/prefect execute flow-run
root        18  0.1  0.1 139372 50532 ?        Sl   12:01   0:11 /usr/local/bin/python -m prefect heartbeat flow-run -i 5b47d02d-xxx-xxx-0303a69f8824
root        41  0.0  0.0   8584  3240 ?        Rs   14:39   0:00 ps aux

andrr

03/22/2022, 2:36 PM

It is the last line in pod log

Copy code

DEBUG - prefect.CloudFlowRunner | Checking flow run state...

Anna Geller

03/22/2022, 3:08 PM

This Discourse topic explains the problem a bit more and provides some solutions you may try https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79 1. Are you on Cloud or Server? 2. If you are on Cloud, can you send us the flow run ID? 3. Any chance you could send us your flow or just explain what your flow is doing? is this doing work in some e external service such as AWS Batch or Databricks?

andrr

03/22/2022, 3:24 PM

Hey Anna, thanks for the reply! I've checked the provided link. Does the error from the link

No heartbeat detected from the remote task

mean the same that

prefect.CloudFlowRunner | Checking flow run state...

? Also, we do not have the out of memory issues. 1) Setting heartbeat mode to threads We've tried this, but after that every manually cancelled flow just stucked in the

Running

state. Then we switched back to

process

heartbeat mode and now only some of the cancelled flow's pod stuck in the

Running

state, not every. And we use

KubernetesRun

instead of

UniversalRun

, can it be a major point?

andrr

03/22/2022, 3:34 PM

Are you on Cloud or Server?

We use Cloud

If you are on Cloud, can you send us the flow run ID?

5b47d02d-1eb6-4c4b-9167-0303a69f8824

Any chance you could send us your flow or just explain what your flow is doing?

Inside the flow we transfer the data between the databases with the intermediate saving in the temporary file. Long-run SQL execs with using the snowflake.connector.

andrr

03/22/2022, 3:42 PM

Tasks inside the flow we define just with the

@task

directive and python

def

(no

ShellTask

DatabricksSubmitRun

). Example

Copy code

@task
def example():
    ...

Kevin Kho

03/22/2022, 4:14 PM

Is the cancellation here triggered by you or you mean Prefect just marks those an cancelled?

andrr

03/22/2022, 4:30 PM

The

Cancelled

state was triggered manually by us. I'm not sure about the

Cancelling

state.

andrr

03/22/2022, 4:35 PM

Am I right, that the

Cancelled

state is a part of the global

Finished

state (docs)? The expected behavior that the heartbeat process detect the flow marked as cancelled and shut down the pod. Or maybe I miss something?

Anna Geller

03/22/2022, 5:03 PM

@andrr Thanks for sharing the ID, the logs were helpful. Let me recap and check whether I understood the problem you are facing. Your flow is executing some long-running Snowflake query triggered from a Kubernetes flow run pod on Azure AKS. The query cannot complete for some reason and both the task run and the flow run keep staying in a Running state until the flow run gets Cancelled by Flow SLA which is set for 1 hour. Questions/Things you may try or check: 1. Can it be that you are not closing the DB connection? Can you DM us your Flow to check? 2. If the query takes longer than 1 hour to complete, you need to remove or modify your SLA-Automation 3. Is the problem that you see that the flow run actually finishes all work successfully but only fails to clean up the underlying flow run pod? Can you say if the actual work doing the Snowflake query etc was performed successfully in the flow run "5b47d02d-1eb6-4c4b-9167-0303a69f8824"?

andrr

03/22/2022, 5:07 PM

Anna, thanks for the detailed answer, I need some time to discuss it with the team.

👍 2

Anna Geller

03/22/2022, 5:10 PM

The ideal end goal would be to not have to Cancel this flow run and make sure it either completes successfully or fails.

andrr

03/22/2022, 5:11 PM

Got it, thank you a lot!

andrr

03/23/2022, 9:38 AM

The ideal end goal would be to not have to Cancel this flow run and make sure it either completes successfully or fails.

Can we mark flow, that runs more then X minutes (the time that this flows is expected to be done) as a

Failed

flow? Right now we can only mark it with the

Cancelling

state.

80 Views

Open in Slack

Previous Next