Hey all. Does anyone know off hand why a flow woul...
# prefect-community
b
Hey all. Does anyone know off hand why a flow wouldn’t get picked up after it’s been submitted? The last version of it got picked up. I’ve deleted the flow, bumped the version tag, rebuilt, and still nothing. Get lazarus killed 3 times and errors out. Agent is running other flows fine.
This is via Cloud, fwiw
j
9 out of 10 times that I've had this problem, it has been because of a mismatch in labels. Check the labels on your agent and on the flow.
One way I've been caught with this is changing the type of
Storage
between versions of a flow. The labels for the flow will change based on the
Storage
you use
b
Ah, it’s probably the latter I guess. I added retries to a task then had to add a result store for an upstream task which I guess would change the storage?
Tags are the same as we’ve always used.
Hm. I used a
PrefectResult
to store this, no new labels seem to have been added. The only label is the one we’ve always run with (
prod
).
c
Usually this is a symptom of something going wrong in your execution environment, for example an image pull error or an unschedulable pod — we’re working on enhancements to the various agents to elevate these errors
b
Interesting, ok. Looking now I see that it registering the flow only pushed ~1mb to ECR, which is usually the first step, then the second is to push the whole 191mb.
@Dylan ^ thanks
d
@Brian Mesick are you running on Kubernetes?
b
Yes
d
Can you find container logs for your submitted flow runs?
In GKE when I run into a problem like this usually I can find the dead job
and the audit logs give me some good clues as to what’s going on
b
It never seemed to spin up containers, no pods or jobs showed up
Or if they spun up they were cleaned up before I could find them
d
If you have a resource manager running from your Prefect k8s agent, try disabling it
the dead jobs should show up
then you can look at their logs
otherwise, try digging into the prefect agent’s deployment logs
b
In the agent logs we just see the
Found 1 flow run(s)…
then
Deploying flow run …
where the guid matches the run that I would expect. Let me try to figure out how to disable the resource manager here. Any idea why those logs wouldn’t percolate up?
d
The logs from the flow run itself should get sent to cloud
Since they’re run on a different node than the agent, they wouldn’t appear in the agent container
I was looking to see if the agent was throwing errors, but it appears everything is working properly
b
Right, but they I would expect them to show up in Cloud?
But I guess if it’s not bootstrapping correctly it couldn’t get there.
d
Correct
this sounds like the job is dying
Which for me is usually an imagePullBackoff error
But that’s hard to see right now
(see Chris’s comment about improving this logging experience)
👍 1
We’re going to do some more work on agents so they’re a little more involved
Hopefully that will help with this issue
b
Cool better logging around this would definitely be helpful. It sounds like we’ve run into it before.
d
Agreed! Let me know if you get access to the job logs
b
Going to be a bit, we need to push through a pr to turn off the resource mgr on the cluster.
So I’m seeing what looks like the same error as someone else had on here a couple of months ago
AttributeError: 'SSLSocket' object has no attribute 'connection'
which they fixed by pinning the Snowflake adapter version.
I’m not sure why that would suddenly show up for us now, in the middle of several dev cycles on the same flow without changing versions of anything.
But we are on old versions of Prefect in the agent and containers
d
Are you using different versions of Prefect for development and in the containers?
b
Maybe, but this flow was running in this container (
prefecthq/prefect:0.12.5-python3.8
) a day or two ago
Minus the few changes I’ve been focused on (swiching from backoff to prefect retry signal being the big one). I’m guessing some other dependency may have updated and bumped the urllib3 requirement
d
I’m glad you were able to diagnose the issue! I believe our containers still
pip install
dependencies. If so, something may have changed as you pushed up a new version of your container while working on the flow
b
@Dylan I was able to get past my issue by pinning urllib to Prefect’s lowest pinned version, but I think it would probably make sense for you all to pin the max version that will actually work as well (there is currently no upper bound) as that would have made this whole thing pretty obvious.
d
Thanks for that suggestion!