https://prefect.io logo
Title
b

Billy McMonagle

04/27/2022, 3:27 PM
Hello! Do you have any tips for monitoring my Prefect flows that fail before they can even begin?
k

Kevin Kho

04/27/2022, 3:27 PM
What is your setup? k8s? ECS?
b

Billy McMonagle

04/27/2022, 3:27 PM
For example, the docker image has an incorrect entrypoint. This results in an obvious failure which is visible from the web UI, but is unable to trigger any state handlers (which would send a slack notification)
EKS.
k

Kevin Kho

04/27/2022, 3:28 PM
Oh have you seen automations ? This is Cloud identifying the state change then sending a message
a

Anna Geller

04/27/2022, 3:28 PM
maybe you just used it as example, but ideally you shouldn't change the entrypoint
b

Billy McMonagle

04/27/2022, 3:29 PM
I know πŸ™‚ πŸ™‚ πŸ™‚
a

Anna Geller

04/27/2022, 3:29 PM
and I didn't fully understand the problem - do you mean you want to get notified when the flow cannot move into a Running state due to issues in the execution layer? in that case, as Kevin mentioned, the SLA Automation makes sense
b

Billy McMonagle

04/27/2022, 3:30 PM
A very well meaning SRE overrode our entrypoint yesterday, and we had 100% flow failure for about 12 hours, until we noticed it in the UI. This is admittedly a weird scenario that shouldn't happen, but it did...
a

Anna Geller

04/27/2022, 3:30 PM
ouch! I understand...
b

Billy McMonagle

04/27/2022, 3:31 PM
automations sounds like it might do it! I will check this out. Thank you both!
(FWIW, my idea was a k8s cronjob to query the graphql endpoint... but the less code I have to write, the better)
πŸ’― 1
a

Anna Geller

04/27/2022, 3:31 PM
I wonder whether you could catch things like this in your CI?
πŸ€” 1
b

Billy McMonagle

04/27/2022, 3:32 PM
Well, CI does the flow registration, which passed. I suppose we could have a little hello world flow that is actually invoked from CI, although I worry it wouldn't match realistic conditions.
a

Anna Geller

04/27/2022, 3:34 PM
how does your image gets built? I could ask a colleague who is an expert in containerization whether it's possible to prevent overriding the entrypoint somehow
cc @jawnsy
b

Billy McMonagle

04/27/2022, 3:35 PM
From AWS Codebuild, we have a makefile that runs the docker build commands.
a

Anna Geller

04/27/2022, 3:37 PM
I'll say you can try Automations for now. We can see whether Jonathan has a better idea πŸ™‚
b

Billy McMonagle

04/27/2022, 3:38 PM
Thanks a bunch!
πŸ‘ 1
j

jawnsy

04/27/2022, 3:44 PM
Thanks for tagging me in, Anna πŸ˜„ @Billy McMonagle Just for my own edification, are you using a cluster dedicated for Prefect, or is it shared with other workloads? For Prefect 2, we’re looking to add support for detecting/surfacing problems like this (by watching events in the namespace and reporting those to Prefect Cloud), but we don’t have anything implemented yet. It also might not be complete monitoring of cluster health, since cluster-level monitoring requires elevated permissions. In the meantime, I’d suggest using CloudWatch to monitor for pod restart events in your cluster, as that would indicate that something is unhealthy. There are a number of other helpful metrics there, such as failed nodes and cluster resource utilization, as well
❀️ 1
b

Billy McMonagle

04/27/2022, 3:49 PM
Hi @jawnsy ... I wouldn't say 100% of the work in this cluster is Prefect, but it's pretty close. I can ask SRE to take a look into metrics. They suggested that we consider using datadog synthetics, which I don't totally understand but seems like it could work.
πŸ‘ 1