Hello, flagging an issue that I am wondering if an...
# prefect-cloud
s
Hello, flagging an issue that I am wondering if anyone else run into regarding gaps in schedules. We have a critical flow that is scheduled to run every 10 minutes. Every few weeks, we discover a gap in scheduled events. In the screenshot of all flows attached here, you can see the typical 10 minute runs and the gap between 12:30 and 1:40. There is no failure and there is no run to reference to alert on. I am wondering if a) anyone else has ran into this b) how users could alert on this c) how we are supposed to troubleshoot something like this. We are running on GCP Kubernetes infrastructure.
🙌 1
c
Hi Steven, can you confirm if the flows didn't run during those times? Just want to rule out that the flows were deleted (possibly indirectly by deleting something else) As for alerting, what I'd recommend is an Automation with a
posture: Proactive
trigger that sends a notification if it hasn't observed a
prefect.flow-run.Completed
event from the
load-shift-assignments
flow. How you'd match that flow depends on your setup and how you've got things organized, but the fundamental trigger is:
Copy code
{
  "posture": "Proactive",
  "match_related": {
    "prefect.resource.name": "load-shift-assignments",
    "prefect.resource.role": "flow"
  },
  "expect": "prefect.flow-run.Completed",
  "threshold": 1,
  "within": 600
}
Breaking that down:
posture: "Proactive"
says "trigger when an event doesn't happen"
match_related: {...}
is saying "match any events that have a related resource named "load-shift-assignments" in the role "flow" (you can take a look at your event feed to see the events for these flow runs, you might want to pick a deployment or work pool instead)
"expect": ["prefect.flow-run.Completed"]
says "i'm expecting a flow run Completed event"
"threshold": 1
says "I'm expecting one event"
"within": 600
says "I'm expecting an event every 600 seconds" (you may want to pad this a bit to account for late start times, etc) Taking that all together, the trigger fires when it hasn't seen 1 "prefect.flow-run.Completed" event for the "load-shift-assignments" flow within the last 10 minutes. The trigger can fire at most once every 10 minutes, then resets until the next window
s
Thanks for the response Chris, I can confirm that they did not run at all at that time.
I will implement that check, is it possible to do with a tag versus a specific flow? I believe I have seen match_related be able to take prefect.resource.role as a tag before.
c
Yep totally, that would look like:
Copy code
"match_related": {
    "prefect.resource.id": "prefect.tag.my-tag-name",
    "prefect.resource.role": "tag"
}
The trigger above would be saying "I expect 1 Completed event for any flow with the tag
my-tag-name
every 10 minutes". If you want to break that out so you can get separate alerts for each flow that misses its window, it would look something like this:
Copy code
"match_related": {
    "prefect.resource.id": "prefect.tag.my-tag-name",
    "prefect.resource.role": "tag"
},
"for_each": ["related:flow:prefect.resource.id"]
That
for_each
syntax is a little obscure, but the idea is that
for_each
will establish separate event counters for each combination of event labels you give it. The
related:flow:
prefix is saying "keep track of a different count for a label in a related flow", so this will keep track of a separate count for each flow (based on its ID). You could use
"related:deployment:prefect.resource.id"
if you wanted to keep track of separate counters by deployment.
The English version of that last trigger is "For each of my flows tagged 'my-tag-name', make sure we get a flow run Completed event every 10 minutes"
s
amazing - thank you!
k
Thanks @Chris Guidry Do you have any idea why this is occurring? Anything we should check on our end?
z
Hey @KG! I can look into this. I wouldn’t expect any gaps in schedules. Would you be able to share your deployment and workspace id? Feel free to DM if you’re not comfortable sharing publicly