https://prefect.io logo
Title
t

Thomas Opsomer

11/28/2022, 3:41 PM
Hello šŸ™‚ We regularly experience a strange behavious in prefect 1.0 where some tasks fails or get stuck but prefect doesn't report them as failed. The logs look like this:
...
some logs from the running task
...
Downloading flow-name/2022-11-25t14-47-58-727718-00-00 from bucket
Beginning Flow run for 'flow-name'
Task 'accounts.link': Starting task run...
...
Flow run RUNNING: terminal tasks are incomplete.
Here is a flow_run_id that leads to a flow stuck like that: bd95ac1b-89c3-497c-bf66-272a918e1be0
Another one if it's of any help to investigate: c92e008e-aff5-41c5-ae72-1f10694ad7f7 šŸ™‚
b

Bianca Hoch

11/28/2022, 4:28 PM
Hello Thomas, thank you for reporting this. Do you by chance have any task concurrency limits set up for your flows?
t

Thomas Opsomer

11/28/2022, 4:31 PM
no not at flow level, only for some tasks
but it's seems that flows get stuck on any kind of tasks. In the first one it's on a task with concurrency limit but not for the second one...
?
b

Bianca Hoch

11/29/2022, 2:23 PM
Hello Thomas, apologies for the delay on this, we're still looking into root causes. I do have a few additional questions for you, if you do mind. ā€¢ Do you have Lazarus enabled for these flows? ā€¢ Is the workflow itself completing successfully (or failing), and the states of the tasks/flows are not updating? ā€¢ How many flows are affected by this? Is it just these two that you mentioned here?
In the meantime, here is an article that will help alleviate the hangups in the interim while we investigate. These queries can be used to clear out runs that are stuck in a transient state (Running or Cancelling) https://discourse.prefect.io/t/how-can-i-remove-flow-task-runs-stuck-in-a-running-or-canceling-state/1855
t

Thomas Opsomer

11/29/2022, 3:14 PM
Hello Bianca, thanks for your reply šŸ™‚ ā€¢ Yes Lazarus is enabled. it's by default no ? ā€¢ No the workflow is stuck: the task seems to be failed or stopped, the flow is running and the task is still blue as running but nothing is happening. ā€¢ I cannot say exactly, but every week this append on about 2 to 5 flows (on 30 running every weekend)
I had to restart to the first flow. Do you want me to keep the second one in the "stuck" step while you're investigating ?
b

Bianca Hoch

11/29/2022, 3:57 PM
Yup, Lazarus is enabled by default. Although, you can disable it, which is why I figured I'd ask šŸ˜… .
Thanks for the additional context! If you want to leave that flow as 'stuck', it would help. Unless it is blocking your production workflows, in that case definitely clear it out using the steps outlined in that article.
t

Thomas Opsomer

11/30/2022, 9:19 AM
Ok I can keep it stuck a bit more then :)
b

Bianca Hoch

11/30/2022, 3:13 PM
Hello Thomas, I do have a suggestion that we can try here. Could you enable version locking for the flows which are being impacted by this? I did see some weird state changes for this task run:
0f46b0c1-314e-4716-aa02-cfbfa3ad4015
. The logs are showing duplicates of the following message
"Task 'contacts.generate': Starting task run..."
on the 28th (following the restart of the flow). Enabling version locking is a safeguard to ensure that the work only runs only once.
t

Thomas Opsomer

11/30/2022, 4:06 PM
actually version locking is enabled by default on all our flows šŸ™‚
we enable this feature to avoid having the same tasks running in duplicate
b

Bianca Hoch

11/30/2022, 5:41 PM
šŸ¤” Hmm..we'll have to take a look at this further then. I'll raise this to the team as well to see what could be happening here.
šŸ‘ 1
t

Thomas Opsomer

11/30/2022, 6:27 PM
alright thanks šŸ˜‰
b

Bianca Hoch

11/30/2022, 8:22 PM
Just so that we can keep tabs on this, would you mind filing a GitHub issue here? Setting the labels for the issue as
v1
will ensure it gets triaged appropriately.
t

Thomas Opsomer

12/01/2022, 10:23 AM
Alright will do ! However I'll have to resume the flow currently stuck. But I'll have other ones next week if needed šŸ™‚
Hello bianca, to follow on: Here are some flows that were affected by the issue: ā€¢ flow 6e2217a0-5771-401f-a17d-636751d7acf1 at task: b8b2d8e8-fcc7-4a3b-a998-741264b45f23?logs. This flow is finished, but we can clearly see in the task logs, that the task was running and at some point. Prefect kind of restart the task:
Task 'accounts.link': Starting task run...
and then we have a strange log:
Task 'accounts.link': Finished task run for task with final state: 'Running'
And then we had to manually restart it... ā€¢ flow: 4b2ed23e-32fd-4794-bdf3-4a05e8b27869 at task: 9c16ce6e-7cea-4edb-96d1-8cfe21bd9813. This one isn't resumed yet, so the stuck task still appears as "running" ā€¢ Also some time we have another strange log:
INFO CloudTaskRunner Task 'accounts.link_ngram': Finished task run for task with final state: 'ClientFailed'
I can't find the flow run ids now, but we had a few of them. And each time the flow doesn't fail.
Another one šŸ™‚ ā€¢ flow: 5d3a9f91-1f87-49d0-8bb8-0d3848d656e3 at task: 092a3f27-db56-4dea-aa47-ad11fafb8cd2
and: ā€¢ flow 8715221f-a8a8-4e92-852b-c8d460623d20 at task c3384b23-4193-4b24-9a4d-228bfdb270bf
up
b

Bianca Hoch

12/06/2022, 4:22 PM
Hey Thomas, thanks for posting these runs. Could you share a bit more about your runtime and the flow itself? What environment is the flow running in, where is it hosted, etc?
Looking at the logs of the flow runs and task runs, I'm seeing duplicated records for different events in the logs that are generating seconds apart
t

Thomas Opsomer

12/06/2022, 4:26 PM
What kind of "environment" information do you want ? We're using prefect 1.4.0. Agent and flows are running on k8s (in GKE)
šŸ‘€ 1
b

Bianca Hoch

12/06/2022, 4:28 PM
Has GKE autopilot been disabled by chance? That is another step we have recommended to people previously to ensure that Prefect and Kubernetes do not fall out of sync.
t

Thomas Opsomer

12/06/2022, 4:31 PM
yes autopilot is disabled
b

Bianca Hoch

12/06/2022, 4:37 PM
Noted. Have you also examined the logs of the agent responsible for running these flows by chance?
Also, to be thorough, I'd suggest updating the additional context section of the GitHub issue as well. I know it's a bit redundant. but it will help with troubleshooting.
Cross posting here just for reference: https://github.com/PrefectHQ/prefect/issues/7712
Just noticed that another user referenced this post, as they were having a similar issue. Mason's explanation of why flows get stuck in a Running state is a good one.
t

Thomas Opsomer

12/06/2022, 4:54 PM
No I haven't look at the agent logs. Usually we don't see when the flow get stuck, so I don't know when to look at it and we don't keep them somewhere... However I can see that the agent restarted this weekend. Can the restart be related to the different issues ?
Thanks for the link to the other thread. Actually we use to have issue like the one with "no heartbeat detected", so I know that if for any reason something happens to the pod running the flow, then it's complicated for prefect to know about it. We used to have this issue with pod rescheduled due to autoscaling. But I don't think the current issues is related to this, because I don't see any events of k8s doing anything with the pods, plus there is the weird "restarting" logs. But I'm going to monitor the agent, because the fact that it restarted is a bit strange to me.
šŸ™Œ 1