https://prefect.io logo
Title
j

James Ashby

04/28/2023, 5:29 PM
I can't get my work queues to start properly. Doing this on a RHEL VM with a service file that just calls
prefect agent start -q main
. When I look at the logs for this, it shows me the agent starts and that it's checking for runs but it doesn't find any. Meanwhile the work queue in Prefect Cloud is unhealthy with a bunch of late runs. I tried reauthenticating to Prefect Cloud from the terminal and restarting the agent and upgrading to the latest prefect version. Any other advice?
w

Will Raphaelson

04/28/2023, 5:31 PM
hmm, can you double check that your agent is connected to the right workspace? the workspace id in your cloud url and the workspace id in your profile on the VM should be the same. You can see your API url on the vm by running
prefect config view
j

James Ashby

04/28/2023, 5:35 PM
yep those ids match. Any other ideas are much appreciated!@
s

Steven Trimboli

04/28/2023, 5:35 PM
I am also having the same issue. The last healthy run I had was at 11AM. Since then, all of my flows are stuck in Late status. So assuming this is a wider spread issue.
We are running on K8s infra on GCP, so would not be a RHEL related issue.
j

James Ashby

04/28/2023, 5:42 PM
When I disabled concurrency limit for the work queue, it got healthy again!
s

Steven Trimboli

04/28/2023, 5:44 PM
Was this on the cloud side? We're currently at flow run concurrency unlimited
Wondering if I add a limit and remove if that will trigger it. Which limit were you at?
j

James Ashby

04/28/2023, 5:55 PM
I set it to 5 and then when I cleared it it worked again. And yeah this was on the cloud side, right like in your screenshot.
s

Steven Trimboli

04/28/2023, 5:56 PM
Ok, yeah, I did the same and they are catching up now. I'm not savvy enough to root cause this, but @Will Raphaelson would you be able to set up an investigation?
j

James Ashby

04/28/2023, 5:56 PM
Okay glad that worked!
w

Will Raphaelson

04/28/2023, 5:58 PM
Yeah I’ll raise this to our on call engineers, thanks.
s

Steven Trimboli

04/28/2023, 5:59 PM
Thank you James and Will!
a

Andrew Brookins

04/28/2023, 7:30 PM
Hey James and Steven! Would either of you be willing to DM me with your account ID or workspace ID?
👍 2
s

Steven Trimboli

04/29/2023, 3:20 PM
Flagging that I had this happened again this morning. Last successful run was at 8:12AM. Had you had any luck investigating? If you need more info, maybe we can schedule a call for Monday? Our fix has been needing to restart our Kubernetes agent cluster, not the work pool concurrency poke that James suggested above. @Andrew Brookins
Also, don't think this is a bug report per se, but we pushed a code change while runs were in the Late status, and through our CI/CD we update our deployments, and that caused all the Late runs to vanish - whereas I would expect the behavior to be similar to running a deployment for scheduled flows where the schedule is maintained. Our setup can now handle late runs for backfilling, but our late runs were removed, so they never got the chance to properly catch up.
@Andrew Brookins has anything been found in relation to this? Had another outage for an hour this morning ~8-9AM
@Andrew Brookins - had another outage last night. Seems like this is related to the scheduler. The main problem is that this doesn't trigger a restart automatically. Normally, this should cause a failed healthcheck, but I don't see that happening.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/services.py", line 120, in _run
    async with self._lifespan():
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/prefect/events/worker.py", line 30, in _lifespan
    async with self._client:
  File "/usr/local/lib/python3.10/site-packages/prefect/events/clients.py", line 118, in __aenter__
    await self._reconnect()
  File "/usr/local/lib/python3.10/site-packages/prefect/events/clients.py", line 136, in _reconnect
    self._websocket = await self._connect.__aenter__()
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 637, in __aenter__
    return await self
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 655, in __await_impl_timeout__
    return await self.__await_impl__()
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in __await_impl__
    _transport, _protocol = await self._create_connection()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1036, in create_connection
    infos = await self._ensure_resolved(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1418, in _ensure_resolved
    return await loop.getaddrinfo(host, port, family=family, type=type,
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 863, in getaddrinfo
    return await self.run_in_executor(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 821, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
a

Andrew Brookins

05/06/2023, 1:14 AM
Hey Steven! I looked over our logs during a couple of these periods of time, but I couldn’t learn enough to build a theory (sometimes hard without the payloads, etc.). However, based on the behavior, I believe it’s a case of this issue: https://github.com/PrefectHQ/prefect/issues/9394 We’re taking a close look at client-side resilience and will be working on a resolution ASAP.