Billy McMonagle

    Billy McMonagle

    4 months ago
    Hello! Do you have any tips for monitoring my Prefect flows that fail before they can even begin?
    Kevin Kho

    Kevin Kho

    4 months ago
    What is your setup? k8s? ECS?
    Billy McMonagle

    Billy McMonagle

    4 months ago
    For example, the docker image has an incorrect entrypoint. This results in an obvious failure which is visible from the web UI, but is unable to trigger any state handlers (which would send a slack notification)
    EKS.
    Kevin Kho

    Kevin Kho

    4 months ago
    Oh have you seen automations ? This is Cloud identifying the state change then sending a message
    Anna Geller

    Anna Geller

    4 months ago
    maybe you just used it as example, but ideally you shouldn't change the entrypoint
    Billy McMonagle

    Billy McMonagle

    4 months ago
    I know πŸ™‚ πŸ™‚ πŸ™‚
    Anna Geller

    Anna Geller

    4 months ago
    and I didn't fully understand the problem - do you mean you want to get notified when the flow cannot move into a Running state due to issues in the execution layer? in that case, as Kevin mentioned, the SLA Automation makes sense
    Billy McMonagle

    Billy McMonagle

    4 months ago
    A very well meaning SRE overrode our entrypoint yesterday, and we had 100% flow failure for about 12 hours, until we noticed it in the UI. This is admittedly a weird scenario that shouldn't happen, but it did...
    Anna Geller

    Anna Geller

    4 months ago
    ouch! I understand...
    Billy McMonagle

    Billy McMonagle

    4 months ago
    automations sounds like it might do it! I will check this out. Thank you both!
    (FWIW, my idea was a k8s cronjob to query the graphql endpoint... but the less code I have to write, the better)
    Anna Geller

    Anna Geller

    4 months ago
    I wonder whether you could catch things like this in your CI?
    Billy McMonagle

    Billy McMonagle

    4 months ago
    Well, CI does the flow registration, which passed. I suppose we could have a little hello world flow that is actually invoked from CI, although I worry it wouldn't match realistic conditions.
    Anna Geller

    Anna Geller

    4 months ago
    how does your image gets built? I could ask a colleague who is an expert in containerization whether it's possible to prevent overriding the entrypoint somehow
    cc @jawnsy
    Billy McMonagle

    Billy McMonagle

    4 months ago
    From AWS Codebuild, we have a makefile that runs the docker build commands.
    Anna Geller

    Anna Geller

    4 months ago
    I'll say you can try Automations for now. We can see whether Jonathan has a better idea πŸ™‚
    Billy McMonagle

    Billy McMonagle

    4 months ago
    Thanks a bunch!
    j

    jawnsy

    4 months ago
    Thanks for tagging me in, Anna πŸ˜„ @Billy McMonagle Just for my own edification, are you using a cluster dedicated for Prefect, or is it shared with other workloads? For Prefect 2, we’re looking to add support for detecting/surfacing problems like this (by watching events in the namespace and reporting those to Prefect Cloud), but we don’t have anything implemented yet. It also might not be complete monitoring of cluster health, since cluster-level monitoring requires elevated permissions. In the meantime, I’d suggest using CloudWatch to monitor for pod restart events in your cluster, as that would indicate that something is unhealthy. There are a number of other helpful metrics there, such as failed nodes and cluster resource utilization, as well
    Billy McMonagle

    Billy McMonagle

    4 months ago
    Hi @jawnsy ... I wouldn't say 100% of the work in this cluster is Prefect, but it's pretty close. I can ask SRE to take a look into metrics. They suggested that we consider using datadog synthetics, which I don't totally understand but seems like it could work.