Looking for some help debugging a failed Flow in Prefect Clo Prefect Community #ask-community

Looking for some help debugging a failed Flow in P...

Scott Zelenka

04/02/2020, 1:48 PM

Looking for some help debugging a failed Flow in Prefect Cloud. I have a Flow that maps over ~1500 items, then combines the results to dump into a DB. On each of those items, I have a retry set and a result-handler pushing the result object to S3. The location where the Flow was executed had a network issue, were it couldn't write to S3 for a brief period of time that appears to coincide with when the mapped Task was in the middle of executing. On the main landing page for the Flow it highlights the most recent Error messages, where we can click to navigate to see more details. In this case I was investigating the

inquisitive-wildcat

Flow Run. And the downstream Tasks all indicate they had a failure. However, when I navigate to that Flow Run and look at the mapped Task, I cannot locate the exact mapped instance where these failures happened. It seems the only way to jump to that specific error message is to navigate from the main Flow page? 1. Shouldn't the failed mapped Task show up as "FAILED" in the UI? 2. Is the Task considered "FAILED" if there was a problem in the Result Handler? On the immediate next Task that consumes the output of the mapped Task, it seems Prefect sent a

None

object, which then caused an exception and finally failed the Flow Run. 3. Why is Prefect sending a

None

to a downstream Task of a mapped Task output that had a failure?

John Ramirez

04/02/2020, 2:01 PM

Hey Scott, I had a similar issue with a high volume mapping task. Which execution approach are you using?

Scott Zelenka

04/02/2020, 2:04 PM

This particular instance was using

KubernetesJobEnvironment

with a

LocalDaskExecutor

John Ramirez

04/02/2020, 2:06 PM

Are you running Kubernetes is AWS, Azure, GCP

Scott Zelenka

04/02/2020, 2:08 PM

the K8 cluster this particular job was running on was an on-prem instance in a laboratory (where the network is notorious for being unstable) .. so the fact it had a network problem is not surprising. I'm more interested in how we can easily identify when it happens in Prefect Cloud, and how to catch issues with the Result Handler failures

John Ramirez

04/02/2020, 2:20 PM

What is happening most likely, based on my experience, is that the network is stopping metadata being sent to Prefect Cloud hence why you have no data. The way I fixed the problem is a refactored my code base to reduce the number a items to map over. Reduces the amount of network traffic. Also I have not used the

KubernetesJobEnvironment

but I have used the

DaskKubernetesEnvironment

and I found it unstable for intense data processing and switch to a

RemoteEnvironment

within Kubernetes

Scott Zelenka

04/02/2020, 2:29 PM

I'll give that a shot, thanks!

Scott Zelenka

04/02/2020, 4:11 PM

Seems I need to rethink this ..

RemoteEnvironment

doesn't let you specify a K8

job_spec.yaml

file. It'd be wonderful if

KubernetesJobEnvironment

would inherit from

RemoteEnvironment

rather than

Environment

. Guess I'll hack around with that later today.

Laura Lorenz (she/her)

05/26/2020, 5:12 PM

@Marvin archive “Observations regarding high volume mapped pipelines returning None on failure”

Marvin

05/26/2020, 5:12 PM

https://github.com/PrefectHQ/prefect/issues/2655

Laura Lorenz (she/her)

05/26/2020, 5:13 PM

^ sorry for the necro post, but archiving this convo and opening an issue about this one as I heard about it elsewhere too, interested if you have any more findings!

2 Views

Open in Slack

Previous Next