Hey all we ran into a production issue today with Prefect wh Prefect Community #ask-community

Hey all, we ran into a production issue today with...

Charles

10/09/2024, 8:18 PM

Hey all, we ran into a production issue today with Prefect, where the workers were all suddenly "late". The workers seem to be working but they werent polling (seems to be hanging). When I press enter in the cmd window (hosting the worker), it all of a sudden starts working. But without it, it's all frozen. What can be causing this? How can we make it robust? Using Prefect 2.18.3 Server

Nate

10/09/2024, 8:32 PM

hi @Charles - how exactly are you running the worker?

Charles

10/09/2024, 9:11 PM

The process worker is run on a remote windows machine, via cmd.exe. To start the worker, we have a script that activates the Prefect Virtual Environment, and it runs

prefect worker start

. The work pool is also a prefect managed workpool with one work queue (default) that houses all of the prefect workers.

Nate

10/09/2024, 9:28 PM

sorry, I'm not familiar with windows so much but I'd guess that

cmd.exe

is not what you want for a long-running service like a worker and the cause of this behavior > When I press enter in the cmd window (hosting the worker), it all of a sudden starts working. But without it, it's all frozen on unix systems, you'd use a thing like

systemd

which google seems to be telling me that NSSM is the windows analogue

Charles

10/09/2024, 9:41 PM

Gotcha. Is there a way to increase robustness in the meantime? For example: • If deployments got sent over to a process worker that is hanging/non responsive, can Prefect have a failover mechanism to reroute those sent deployments to another worker that may be working? • Can Prefect do any mitigation via utilizing the heartbeat of a process worker to determine the responsiveness of a worker Thanks for the tips of converting to a service Ill look into NSSM as well! We just want to ensure robustness in our production system so that we dont miss any flow runs

Nate

10/09/2024, 9:55 PM

can Prefect have a failover mechanism to reroute those sent deployments to another worker that may be working

this would happen by default. any worker who is listening to

--pool some-pool

(from

prefect worker start

command) will pick up run scheduled in that pool the number one recommendation for robustness would be to use a tool that's meant to run a long-lived process as a service since I'd guess using cmd.exe is analogous to running

prefect worker start

on my laptop, ie if I close my laptop, the process stops

Charles

10/09/2024, 10:20 PM

So when the deployment is already sent to that faulty unresponsive process worker, theres some sort of timeout process that auto runs on another worker thats listening to that pool when it is late? I agree using a process as a service can work as well, but was wondering if it also runs into a "unresponsive state" how it will work with failovers with multiple process workers

Nate

10/09/2024, 10:24 PM

i would suggest a slight paradigm shift. it’s not that runs are sent to workers, runs are scheduled under a work pool, and then workers that are both functional and subscribed to that pool can discover those scheduled runs. if its a useful analogy, if flow runs are messages, the prefect server’s scheduling service is a publisher, the work pool is a topic and the worker is a consumer if no worker picks up a run, then after some designated time after the scheduled time of the run, it will become a Late run, which is just a cue for humans or automations to respond to the absence of a run execution.

Nate

10/09/2024, 10:25 PM

is that helpful?

Charles

10/09/2024, 10:32 PM

Awesome that makes a lot of sense 🙂 I'm asking because we've seen conditions where: • We have a deployment run in a work pool (picked up by process worker A) • Process worker A runs, and is then unresponsive (we killed it) • The task is still 'running' according to the UI, and isnt moved to process worker B Im assuming this can be mitigated if we had a timeout for the task, and when it enters a retry state, then it will be picked up by B

71 Views

Open in Slack

Previous Next