I have Flows that are seemingly being restarted in...
# prefect-community
d
I have Flows that are seemingly being restarted in the middle of a long-running task, trying to track down why. Abbreviated log looks like below, whereas at 19:39 it's explicitly killed and rescheduled by Cloud, prior that at 19:22 there's a seemingly spontaneous and unexplained second "Beginning Flow run" --- suggestions on how to narrow down the cause of this?
Copy code
TIMESTAMP                         LEVEL    MESSAGE
2020-05-19T18:58:29.732839+00:00  INFO     Submitted for execution: Job prefect-job-35b1423b
2020-05-19T19:04:28.107917+00:00  INFO     Beginning Flow run for 'compute_**_flow'
2020-05-19T19:04:28.240831+00:00  INFO     Starting flow run.
2020-05-19T19:04:28.241095+00:00  DEBUG    Flow 'compute_**_flow': Handling state change from Scheduled to Running
2020-05-19T19:05:12.954233+00:00  INFO     Task 'compute_**_task': Starting task run...
2020-05-19T19:05:12.95458+00:00   DEBUG    Task 'compute_**_task': Handling state change from Pending to Running
2020-05-19T19:05:13.210047+00:00  DEBUG    Task 'compute_**_task': Calling task.run() method...
2020-05-19T19:22:07.863766+00:00  INFO     Beginning Flow run for 'compute_**_flow'
2020-05-19T19:22:08.577243+00:00  INFO     Task 'compute_**_task': Starting task run...
2020-05-19T19:22:08.578027+00:00  DEBUG    Task 'compute_**_task': task is already running.
2020-05-19T19:22:08.59477+00:00   INFO     Task 'compute_**_task': finished task run for task with final state: 'Running'
2020-05-19T19:24:34.702197+00:00  ERROR    Marked "Failed" by a Zombie Killer process.
2020-05-19T19:39:33.646426+00:00  INFO     Rescheduled by a Lazarus process. This is attempt 1.
2020-05-19T19:39:56.331103+00:00  INFO     Submitted for execution: Job prefect-job-21a433ec
2020-05-19T19:42:33.754737+00:00  INFO     Beginning Flow run for 'compute_**_flow'
2020-05-19T19:42:33.869824+00:00  INFO     Starting flow run.
...
z
Hi @Dan DiPasquo! Each task run sends a heartbeat back to Cloud to indicate its health. If a task misses more than four heartbeats in a row, it's marked as a zombie and failed so that it can be rescheduled by the Lazarus process. We most commonly see tasks fail as zombies when the infrastructure they're executing on is strained, but it's hard to say without further insight into your infrastructure. If that's desirable behavior and you were just curious about it, great! If not, we also have the option to disable this. To do so, you can either navigate to the
Settings
tab for your flow in the UI or execute the GraphQL mutation linked below. https://docs.prefect.io/orchestration/concepts/flows.html#toggling-heartbeats
d
HI @Zachary Hughes, thanks - I see/understand that Flow is being killed and restarted by Zombie Killer/Lazarus at 2020-05-19T19:24 -- what I don't understand is why is the flow starting again before that, at 2020-05-19T19:22 ? - in total this flow was started (Beginning Flow log messages) 3 times -- (1) original start, (2) ???, (3) Zombie Killer/Lazarus kill/restart
z
Okay, gotcha. If possible, do you mind DMing me the flow run ID associated with these logs? That would help me dig a bit deeper into this behavior.
d
Yes let me find that, thank you