https://prefect.io logo
Title
l

Lee Mendelowitz

01/29/2023, 4:18 PM
I’ve set up Automations in my workspace for when work queues enters unhealthy or healthy, and I’ve noticed a lot of false alarms. We have 1 agent running prefect 2.7.9 servicing 2 work queues. We get about 5-6 notifications for unhealthy work queues a day but every time I check they are healthy. Some behavior I’ve observed: • A work queue unhealthy notification comes in, followed by a healthy work queue notification around a minute later • A work queue healthy notification comes in with no corresponding unhealthy work queue notification proceeding it • Only one of the two work queues get unhealthy notification, even though they are serviced by the same agent. I don’t see any error messages in the logs for the prefect2 agents. Any ideas how to diagnose or reduce the false alarms?
w

Will Raphaelson

01/30/2023, 3:05 AM
Thanks Lee - work queues getting marked unhealthy can happen under two different conditions. 1) that a work queue that was being polled hasnt been polled in 60 seconds or more, and 2) that the work queue has at least one flow run in it marked late. Are either of these conditions satisfied when you’re getting the notifications? Whether or not this is a bug, we’ll be enabling users to set their own custom work queue health definitions in the near future.
l

Lee Mendelowitz

01/30/2023, 3:20 AM
Thanks for the details. We do have some late flow runs for deployments that run on AWS ECS because it can take ~90 seconds for those flows to start. However the timestamps on those flow runs don’t line up with when we’ve been receiving the unhealthy work queue notifications, so I’m not sure that this what’s triggering the unhealthy notifications.
w

Will Raphaelson

01/30/2023, 3:21 AM
Well, the wq health evaluation isnt quite real time, that could explain the delta there. Let me circle up with the team on this tomorrow and noodle on ways forward. Thanks for raising it.
👍 1
Hey @Lee Mendelowitz, thought about this a bit. I do think we’ll be able to address this in a cleaner way in the coming month or two with more highly customizable work queue health policies. that said, I think for now posting a custom trigger via API might help. The docs are here, you could up the threshold and within keys to only fire the trigger if it saw two wq unhealthy events in 10 minutes, for example. So that first wq unhealthy event (fired when the run was deemed late for taking 90 seconds to spin up) wouldnt trigger the alarm, but if two happened, it would.
j

Jean-Michel Provencher

03/15/2023, 11:41 AM
@Ton Steijvers I'll disable the Work Queue notifications in the meantime, we've been getting the same false alarms.
👍 1