https://prefect.io logo
Title
b

Brian Mesick

08/19/2020, 5:14 PM
Hey all. Does anyone know off hand why a flow wouldn’t get picked up after it’s been submitted? The last version of it got picked up. I’ve deleted the flow, bumped the version tag, rebuilt, and still nothing. Get lazarus killed 3 times and errors out. Agent is running other flows fine.
This is via Cloud, fwiw
j

james.lamb

08/19/2020, 5:21 PM
9 out of 10 times that I've had this problem, it has been because of a mismatch in labels. Check the labels on your agent and on the flow.
One way I've been caught with this is changing the type of
Storage
between versions of a flow. The labels for the flow will change based on the
Storage
you use
b

Brian Mesick

08/19/2020, 5:23 PM
Ah, it’s probably the latter I guess. I added retries to a task then had to add a result store for an upstream task which I guess would change the storage?
Tags are the same as we’ve always used.
Hm. I used a
PrefectResult
to store this, no new labels seem to have been added. The only label is the one we’ve always run with (
prod
).
c

Chris White

08/19/2020, 5:48 PM
Usually this is a symptom of something going wrong in your execution environment, for example an image pull error or an unschedulable pod — we’re working on enhancements to the various agents to elevate these errors
b

Brian Mesick

08/19/2020, 6:02 PM
Interesting, ok. Looking now I see that it registering the flow only pushed ~1mb to ECR, which is usually the first step, then the second is to push the whole 191mb.
@Dylan ^ thanks
d

Dylan

08/20/2020, 4:08 PM
@Brian Mesick are you running on Kubernetes?
b

Brian Mesick

08/20/2020, 4:08 PM
Yes
d

Dylan

08/20/2020, 4:11 PM
Can you find container logs for your submitted flow runs?
In GKE when I run into a problem like this usually I can find the dead job
and the audit logs give me some good clues as to what’s going on
b

Brian Mesick

08/20/2020, 4:12 PM
It never seemed to spin up containers, no pods or jobs showed up
Or if they spun up they were cleaned up before I could find them
d

Dylan

08/20/2020, 4:14 PM
If you have a resource manager running from your Prefect k8s agent, try disabling it
the dead jobs should show up
then you can look at their logs
otherwise, try digging into the prefect agent’s deployment logs
b

Brian Mesick

08/20/2020, 4:16 PM
In the agent logs we just see the
Found 1 flow run(s)…
then
Deploying flow run …
where the guid matches the run that I would expect. Let me try to figure out how to disable the resource manager here. Any idea why those logs wouldn’t percolate up?
d

Dylan

08/20/2020, 4:17 PM
The logs from the flow run itself should get sent to cloud
Since they’re run on a different node than the agent, they wouldn’t appear in the agent container
I was looking to see if the agent was throwing errors, but it appears everything is working properly
b

Brian Mesick

08/20/2020, 4:18 PM
Right, but they I would expect them to show up in Cloud?
But I guess if it’s not bootstrapping correctly it couldn’t get there.
d

Dylan

08/20/2020, 4:21 PM
Correct
this sounds like the job is dying
Which for me is usually an imagePullBackoff error
But that’s hard to see right now
(see Chris’s comment about improving this logging experience)
👍 1
We’re going to do some more work on agents so they’re a little more involved
Hopefully that will help with this issue
b

Brian Mesick

08/20/2020, 4:29 PM
Cool better logging around this would definitely be helpful. It sounds like we’ve run into it before.
d

Dylan

08/20/2020, 4:39 PM
Agreed! Let me know if you get access to the job logs
b

Brian Mesick

08/20/2020, 4:41 PM
Going to be a bit, we need to push through a pr to turn off the resource mgr on the cluster.
So I’m seeing what looks like the same error as someone else had on here a couple of months ago
AttributeError: 'SSLSocket' object has no attribute 'connection'
which they fixed by pinning the Snowflake adapter version.
I’m not sure why that would suddenly show up for us now, in the middle of several dev cycles on the same flow without changing versions of anything.
But we are on old versions of Prefect in the agent and containers
d

Dylan

08/20/2020, 7:01 PM
Are you using different versions of Prefect for development and in the containers?
b

Brian Mesick

08/20/2020, 7:03 PM
Maybe, but this flow was running in this container (
prefecthq/prefect:0.12.5-python3.8
) a day or two ago
Minus the few changes I’ve been focused on (swiching from backoff to prefect retry signal being the big one). I’m guessing some other dependency may have updated and bumped the urllib3 requirement
d

Dylan

08/20/2020, 7:06 PM
I’m glad you were able to diagnose the issue! I believe our containers still
pip install
dependencies. If so, something may have changed as you pushed up a new version of your container while working on the flow
b

Brian Mesick

08/20/2020, 8:52 PM
@Dylan I was able to get past my issue by pinning urllib to Prefect’s lowest pinned version, but I think it would probably make sense for you all to pin the max version that will actually work as well (there is currently no upper bound) as that would have made this whole thing pretty obvious.
d

Dylan

08/20/2020, 8:55 PM
Thanks for that suggestion!