I can't get my work queues to start properly. Doin...
# ask-community
j
I can't get my work queues to start properly. Doing this on a RHEL VM with a service file that just calls
prefect agent start -q main
. When I look at the logs for this, it shows me the agent starts and that it's checking for runs but it doesn't find any. Meanwhile the work queue in Prefect Cloud is unhealthy with a bunch of late runs. I tried reauthenticating to Prefect Cloud from the terminal and restarting the agent and upgrading to the latest prefect version. Any other advice?
w
hmm, can you double check that your agent is connected to the right workspace? the workspace id in your cloud url and the workspace id in your profile on the VM should be the same. You can see your API url on the vm by running
prefect config view
j
yep those ids match. Any other ideas are much appreciated!@
s
I am also having the same issue. The last healthy run I had was at 11AM. Since then, all of my flows are stuck in Late status. So assuming this is a wider spread issue.
We are running on K8s infra on GCP, so would not be a RHEL related issue.
j
When I disabled concurrency limit for the work queue, it got healthy again!
s
Was this on the cloud side? We're currently at flow run concurrency unlimited
Wondering if I add a limit and remove if that will trigger it. Which limit were you at?
j
I set it to 5 and then when I cleared it it worked again. And yeah this was on the cloud side, right like in your screenshot.
s
Ok, yeah, I did the same and they are catching up now. I'm not savvy enough to root cause this, but @Will Raphaelson would you be able to set up an investigation?
j
Okay glad that worked!
w
Yeah I’ll raise this to our on call engineers, thanks.
s
Thank you James and Will!
a
Hey James and Steven! Would either of you be willing to DM me with your account ID or workspace ID?
👍 2
s
Flagging that I had this happened again this morning. Last successful run was at 8:12AM. Had you had any luck investigating? If you need more info, maybe we can schedule a call for Monday? Our fix has been needing to restart our Kubernetes agent cluster, not the work pool concurrency poke that James suggested above. @Andrew Brookins
Also, don't think this is a bug report per se, but we pushed a code change while runs were in the Late status, and through our CI/CD we update our deployments, and that caused all the Late runs to vanish - whereas I would expect the behavior to be similar to running a deployment for scheduled flows where the schedule is maintained. Our setup can now handle late runs for backfilling, but our late runs were removed, so they never got the chance to properly catch up.
@Andrew Brookins has anything been found in relation to this? Had another outage for an hour this morning ~8-9AM
@Andrew Brookins - had another outage last night. Seems like this is related to the scheduler. The main problem is that this doesn't trigger a restart automatically. Normally, this should cause a failed healthcheck, but I don't see that happening.
Copy code
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/services.py", line 120, in _run
    async with self._lifespan():
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/prefect/events/worker.py", line 30, in _lifespan
    async with self._client:
  File "/usr/local/lib/python3.10/site-packages/prefect/events/clients.py", line 118, in __aenter__
    await self._reconnect()
  File "/usr/local/lib/python3.10/site-packages/prefect/events/clients.py", line 136, in _reconnect
    self._websocket = await self._connect.__aenter__()
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 637, in __aenter__
    return await self
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 655, in __await_impl_timeout__
    return await self.__await_impl__()
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in __await_impl__
    _transport, _protocol = await self._create_connection()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1036, in create_connection
    infos = await self._ensure_resolved(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1418, in _ensure_resolved
    return await loop.getaddrinfo(host, port, family=family, type=type,
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 863, in getaddrinfo
    return await self.run_in_executor(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 821, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
a
Hey Steven! I looked over our logs during a couple of these periods of time, but I couldn’t learn enough to build a theory (sometimes hard without the payloads, etc.). However, based on the behavior, I believe it’s a case of this issue: https://github.com/PrefectHQ/prefect/issues/9394 We’re taking a close look at client-side resilience and will be working on a resolution ASAP.