Hey all, we ran into a production issue today with...
# ask-community
c
Hey all, we ran into a production issue today with Prefect, where the workers were all suddenly "late". The workers seem to be working but they werent polling (seems to be hanging). When I press enter in the cmd window (hosting the worker), it all of a sudden starts working. But without it, it's all frozen. What can be causing this? How can we make it robust? Using Prefect 2.18.3 Server
n
hi @Charles - how exactly are you running the worker?
c
The process worker is run on a remote windows machine, via cmd.exe. To start the worker, we have a script that activates the Prefect Virtual Environment, and it runs
prefect worker start
. The work pool is also a prefect managed workpool with one work queue (default) that houses all of the prefect workers.
n
sorry, I'm not familiar with windows so much but I'd guess that
cmd.exe
is not what you want for a long-running service like a worker and the cause of this behavior > When I press enter in the cmd window (hosting the worker), it all of a sudden starts working. But without it, it's all frozen on unix systems, you'd use a thing like
systemd
which google seems to be telling me that NSSM is the windows analogue
c
Gotcha. Is there a way to increase robustness in the meantime? For example: • If deployments got sent over to a process worker that is hanging/non responsive, can Prefect have a failover mechanism to reroute those sent deployments to another worker that may be working? • Can Prefect do any mitigation via utilizing the heartbeat of a process worker to determine the responsiveness of a worker Thanks for the tips of converting to a service Ill look into NSSM as well! We just want to ensure robustness in our production system so that we dont miss any flow runs
n
can Prefect have a failover mechanism to reroute those sent deployments to another worker that may be working
this would happen by default. any worker who is listening to
--pool some-pool
(from
prefect worker start
command) will pick up run scheduled in that pool the number one recommendation for robustness would be to use a tool that's meant to run a long-lived process as a service since I'd guess using cmd.exe is analogous to running
prefect worker start
on my laptop, ie if I close my laptop, the process stops
c
So when the deployment is already sent to that faulty unresponsive process worker, theres some sort of timeout process that auto runs on another worker thats listening to that pool when it is late? I agree using a process as a service can work as well, but was wondering if it also runs into a "unresponsive state" how it will work with failovers with multiple process workers
n
i would suggest a slight paradigm shift. it’s not that runs are sent to workers, runs are scheduled under a work pool, and then workers that are both functional and subscribed to that pool can discover those scheduled runs. if its a useful analogy, if flow runs are messages, the prefect server’s scheduling service is a publisher, the work pool is a topic and the worker is a consumer if no worker picks up a run, then after some designated time after the scheduled time of the run, it will become a Late run, which is just a cue for humans or automations to respond to the absence of a run execution.
is that helpful?
c
Awesome that makes a lot of sense 🙂 I'm asking because we've seen conditions where: • We have a deployment run in a work pool (picked up by process worker A) • Process worker A runs, and is then unresponsive (we killed it) • The task is still 'running' according to the UI, and isnt moved to process worker B Im assuming this can be mitigated if we had a timeout for the task, and when it enters a retry state, then it will be picked up by B