Good morning, I have a question about the Prefect ...
# ask-community
b
Good morning, I have a question about the Prefect Cloud incident earlier today...
(If there is a more appropriate place to make this as a support request, I'm happy to do that instead)
Today at ~8:30am EST, we experienced some failures. Specifically, the prefect API was unavailable:
Copy code
HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: / (Caused by ReadTimeoutError("HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out. (read timeout=15)"))
In addition to the failures, several flows hung around for about 2 hours (you can see the tall bars). Some completed successfully after that time, and one failed.
I understand that outages happen and this one wasn't particularly painful. However, my flow's state handlers (for failures) did not fire. I'd like to know if there's anything I can do to be more resilient in this situation.
FWIW here is a link to the incident. For next time, I have subscribed to webhook + RSS notifications, which is awesome https://prefect.status.io/pages/incident/5f33ff702715c204c20d6da1/61c08c578e1cce053fe15882
z
Hey Billy, it’s possible that there is a bug in handling failures caused by API issues that results in the state handlers not being called.
I’d be interested in investigating a fix that ensures your state handler is called in this case, so you can add handling for service disruptions.
b
Hey Michael, yes I agree that's likely what is happening. If the webhook can't fire bc the service is down, that's one thing, but if I can fix something on my end I will. I also don't 100% understand what happens when such a failure is encountered mid-flow-run as opposed to "the flow run can't start".
My state handler code is:
Copy code
def make_slack_failure_notification(slack_webhook_function: Callable):
    """
    Returns a callable which can be supplied to the callback_factory.
    """

    def _slack_failure_notification(obj, new_state):
        """
        State handler that posts to Slack on flow or task failure.
        """
        msg = _construct_failure_message(obj)
        SlackTask().run(message=msg, webhook_url=slack_webhook_function())

    return _slack_failure_notification


def _construct_failure_message(obj: Union[Flow, Task]):
    """
    Given Flow/Task object, get contextual info and return failure message
    """
    project_name = prefect.context.get("project_name")
    flow_name = prefect.context.get("flow_name")
    if isinstance(obj, Flow):
        run_type = "flow"
        run_id = prefect.context.get("flow_run_id")
        run_url = f"<https://cloud.prefect.io/flow-run/{run_id}>"
        task_name = ""
    elif isinstance(obj, Task):
        run_type = "task"
        run_id = prefect.context.get("task_run_id")
        run_url = f"<https://cloud.prefect.io/task-run/{run_id}>"
        task_name = f".{obj.name}"

    msg = (
        f"Error in {run_type} {project_name}.{flow_name}{task_name}\n"
        f"For more, see: {run_url}"
    )
    return msg


def make_failure_callback(slack_webhook_function: Callable):
    """
    Return a callback function which can be passed as a state handler
    to a Prefect Flow or Task.
    Failure messages will be posted to the provided webhook url.
    """
    return callback_factory(
        fn=make_slack_failure_notification(slack_webhook_function),
        check=lambda state: state.is_failed(),
    )
z
Would you mind opening an issue in Github?
y
Oh I actually had this issue too at the same time. I'm just running the documentation tutorial code.
b
Hey @Zanie sorry for the delay, I've just opened the issue here: https://github.com/PrefectHQ/prefect/issues/5261
z
Thank you! We’ll dig into it when we can.