Hey all Does anyone know off hand why a flow wouldn t get pi Prefect Community #ask-community

Hey all. Does anyone know off hand why a flow woul...

Brian Mesick

08/19/2020, 5:14 PM

Hey all. Does anyone know off hand why a flow wouldn’t get picked up after it’s been submitted? The last version of it got picked up. I’ve deleted the flow, bumped the version tag, rebuilt, and still nothing. Get lazarus killed 3 times and errors out. Agent is running other flows fine.

Brian Mesick

08/19/2020, 5:14 PM

This is via Cloud, fwiw

james.lamb

08/19/2020, 5:21 PM

9 out of 10 times that I've had this problem, it has been because of a mismatch in labels. Check the labels on your agent and on the flow.

james.lamb

08/19/2020, 5:22 PM

One way I've been caught with this is changing the type of

Storage

between versions of a flow. The labels for the flow will change based on the

Storage

you use

Brian Mesick

08/19/2020, 5:23 PM

Ah, it’s probably the latter I guess. I added retries to a task then had to add a result store for an upstream task which I guess would change the storage?

Brian Mesick

08/19/2020, 5:24 PM

Tags are the same as we’ve always used.

Brian Mesick

08/19/2020, 5:29 PM

Hm. I used a

PrefectResult

to store this, no new labels seem to have been added. The only label is the one we’ve always run with (

prod

Chris White

08/19/2020, 5:48 PM

Usually this is a symptom of something going wrong in your execution environment, for example an image pull error or an unschedulable pod — we’re working on enhancements to the various agents to elevate these errors

Brian Mesick

08/19/2020, 6:02 PM

Interesting, ok. Looking now I see that it registering the flow only pushed ~1mb to ECR, which is usually the first step, then the second is to push the whole 191mb.

Brian Mesick

08/20/2020, 3:41 PM

@Dylan ^ thanks

Dylan

08/20/2020, 4:08 PM

@Brian Mesick are you running on Kubernetes?

Brian Mesick

08/20/2020, 4:08 PM

Yes

Dylan

08/20/2020, 4:11 PM

Can you find container logs for your submitted flow runs?

Dylan

08/20/2020, 4:12 PM

In GKE when I run into a problem like this usually I can find the dead job

Dylan

08/20/2020, 4:12 PM

and the audit logs give me some good clues as to what’s going on

Brian Mesick

08/20/2020, 4:12 PM

It never seemed to spin up containers, no pods or jobs showed up

Brian Mesick

08/20/2020, 4:13 PM

Or if they spun up they were cleaned up before I could find them

Dylan

08/20/2020, 4:14 PM

If you have a resource manager running from your Prefect k8s agent, try disabling it

Dylan

08/20/2020, 4:14 PM

the dead jobs should show up

Dylan

08/20/2020, 4:14 PM

then you can look at their logs

Dylan

08/20/2020, 4:14 PM

otherwise, try digging into the prefect agent’s deployment logs

Brian Mesick

08/20/2020, 4:16 PM

In the agent logs we just see the

Found 1 flow run(s)…

then

Deploying flow run …

where the guid matches the run that I would expect. Let me try to figure out how to disable the resource manager here. Any idea why those logs wouldn’t percolate up?

Dylan

08/20/2020, 4:17 PM

The logs from the flow run itself should get sent to cloud

Dylan

08/20/2020, 4:17 PM

Since they’re run on a different node than the agent, they wouldn’t appear in the agent container

Dylan

08/20/2020, 4:18 PM

I was looking to see if the agent was throwing errors, but it appears everything is working properly

Brian Mesick

08/20/2020, 4:18 PM

Right, but they I would expect them to show up in Cloud?

Brian Mesick

08/20/2020, 4:20 PM

But I guess if it’s not bootstrapping correctly it couldn’t get there.

Dylan

08/20/2020, 4:21 PM

Correct

Dylan

08/20/2020, 4:21 PM

this sounds like the job is dying

Dylan

08/20/2020, 4:21 PM

Which for me is usually an imagePullBackoff error

Dylan

08/20/2020, 4:21 PM

But that’s hard to see right now

Dylan

08/20/2020, 4:22 PM

(see Chris’s comment about improving this logging experience)

👍 1

Dylan

08/20/2020, 4:22 PM

We’re going to do some more work on agents so they’re a little more involved

Dylan

08/20/2020, 4:22 PM

Hopefully that will help with this issue

Brian Mesick

08/20/2020, 4:29 PM

Cool better logging around this would definitely be helpful. It sounds like we’ve run into it before.

Dylan

08/20/2020, 4:39 PM

Agreed! Let me know if you get access to the job logs

Brian Mesick

08/20/2020, 4:41 PM

Going to be a bit, we need to push through a pr to turn off the resource mgr on the cluster.

Brian Mesick

08/20/2020, 6:56 PM

So I’m seeing what looks like the same error as someone else had on here a couple of months ago

AttributeError: 'SSLSocket' object has no attribute 'connection'

which they fixed by pinning the Snowflake adapter version.

Brian Mesick

08/20/2020, 6:56 PM

I’m not sure why that would suddenly show up for us now, in the middle of several dev cycles on the same flow without changing versions of anything.

Brian Mesick

08/20/2020, 6:57 PM

But we are on old versions of Prefect in the agent and containers

Dylan

08/20/2020, 7:01 PM

Are you using different versions of Prefect for development and in the containers?

Brian Mesick

08/20/2020, 7:03 PM

Maybe, but this flow was running in this container (

prefecthq/prefect:0.12.5-python3.8

) a day or two ago

Brian Mesick

08/20/2020, 7:04 PM

Minus the few changes I’ve been focused on (swiching from backoff to prefect retry signal being the big one). I’m guessing some other dependency may have updated and bumped the urllib3 requirement

Dylan

08/20/2020, 7:06 PM

I’m glad you were able to diagnose the issue! I believe our containers still

pip install

dependencies. If so, something may have changed as you pushed up a new version of your container while working on the flow

Brian Mesick

08/20/2020, 8:52 PM

@Dylan I was able to get past my issue by pinning urllib to Prefect’s lowest pinned version, but I think it would probably make sense for you all to pin the max version that will actually work as well (there is currently no upper bound) as that would have made this whole thing pretty obvious.

Dylan

08/20/2020, 8:55 PM

Thanks for that suggestion!

10 Views

Open in Slack

Previous Next