Hello! Do you have any tips for monitoring my Pref...
# prefect-community
b
Hello! Do you have any tips for monitoring my Prefect flows that fail before they can even begin?
k
What is your setup? k8s? ECS?
b
For example, the docker image has an incorrect entrypoint. This results in an obvious failure which is visible from the web UI, but is unable to trigger any state handlers (which would send a slack notification)
EKS.
k
Oh have you seen automations ? This is Cloud identifying the state change then sending a message
a
maybe you just used it as example, but ideally you shouldn't change the entrypoint
b
I know πŸ™‚ πŸ™‚ πŸ™‚
a
and I didn't fully understand the problem - do you mean you want to get notified when the flow cannot move into a Running state due to issues in the execution layer? in that case, as Kevin mentioned, the SLA Automation makes sense
b
A very well meaning SRE overrode our entrypoint yesterday, and we had 100% flow failure for about 12 hours, until we noticed it in the UI. This is admittedly a weird scenario that shouldn't happen, but it did...
a
ouch! I understand...
b
automations sounds like it might do it! I will check this out. Thank you both!
(FWIW, my idea was a k8s cronjob to query the graphql endpoint... but the less code I have to write, the better)
πŸ’― 1
a
I wonder whether you could catch things like this in your CI?
πŸ€” 1
b
Well, CI does the flow registration, which passed. I suppose we could have a little hello world flow that is actually invoked from CI, although I worry it wouldn't match realistic conditions.
a
how does your image gets built? I could ask a colleague who is an expert in containerization whether it's possible to prevent overriding the entrypoint somehow
cc @jawnsy
b
From AWS Codebuild, we have a makefile that runs the docker build commands.
a
I'll say you can try Automations for now. We can see whether Jonathan has a better idea πŸ™‚
b
Thanks a bunch!
πŸ‘ 1
j
Thanks for tagging me in, Anna πŸ˜„ @Billy McMonagle Just for my own edification, are you using a cluster dedicated for Prefect, or is it shared with other workloads? For Prefect 2, we’re looking to add support for detecting/surfacing problems like this (by watching events in the namespace and reporting those to Prefect Cloud), but we don’t have anything implemented yet. It also might not be complete monitoring of cluster health, since cluster-level monitoring requires elevated permissions. In the meantime, I’d suggest using CloudWatch to monitor for pod restart events in your cluster, as that would indicate that something is unhealthy. There are a number of other helpful metrics there, such as failed nodes and cluster resource utilization, as well
❀️ 1
b
Hi @jawnsy ... I wouldn't say 100% of the work in this cluster is Prefect, but it's pretty close. I can ask SRE to take a look into metrics. They suggested that we consider using datadog synthetics, which I don't totally understand but seems like it could work.
πŸ‘ 1