Hi all, here’s a weird case we’re running into tod...
# ask-community
b
Hi all, here’s a weird case we’re running into today. A flow running on K8s that hasn’t changed in a few wks has started failing consistently with:
Copy code
Container 'flow' state: terminated
		Exit Code:: 2
		Reason: Error
The job shows up submitted, then ~30 seconds later we get 4 of those errors. Other nearly identical flows are working. The pods don’t hang around long enough to get logs (if they even start). Has anyone else run into this?
We’re not doing anything particularly fancy in configuration. KubernetesRun / LocalDaskExecutor / Docker storage:
Copy code
run_config = KubernetesRun(cpu_request=1, memory_request="2Gi", labels=["prod"])
executor = LocalDaskExecutor(scheduler="threads", num_workers=4)

storage = Docker(
    registry_url="<redacted>.<http://dkr.ecr.us-east-1.amazonaws.com|dkr.ecr.us-east-1.amazonaws.com>",
    image_name="...",
    image_tag="v2",
    base_image="python:3.8",
    python_dependencies=[
...
    ],
)
Agent logs show it being scheduled, but nothing else.
n
Hi @Brian Mesick - where are you seeing those initial logs?
b
Hi @nicholas they’re the flow run logs.
Missed the first line, which is
Pod prefect-job-33... failed.
n
Ah ok, thanks! Is there a possibility that your k8s cluster is hitting resource constraints?
b
In theory, though it shouldn’t have been doing anything at that point and other jobs with the same resource requests have succeeded while that one continues to fail.
n
Got it - and you mentioned that the pods aren't around long enough to generate any logs, I'm guessing you mean from inspecting the job logs themselves?
b
Correct
We’re pushing a new version now as that seems to sometimes clear these things, just curious if it was a known issue
n
Hm it's not but I'd be very curious what you find when pushing a new version.
b
We did recently run into a cpu constraint causing jobs to now schedule, but the logs for that were pretty clear that the cluster couldn’t scale
n
Yeah that sounds straightforward enough - did anything with the k8s template on your agent change?
b
I don’t believe so. Agent has been running for a couple of days and had run this job once or twice before without issue
I’d expect that
error code 2
is coming from Docker / Prefect running the flow. Any idea what that result code might mean?
n
oh that's interesting, i think that error code is related to a bad call in the container startup command
b
Hmm, is that something Prefect is doing behind the scenes?
FWIW pushing a new image with the same code and versions solved the problem. ¯\_(ツ)_/¯
n
Hi @Brian Mesick - sorry for the slow response over here! If you see this start to crop up again, let me know. It could be that there's an image pull backoff or something (though that would normally be reflected in logs)
To answer your earlier question: yes sort of, in that Prefect submits jobs and has to init the container somehow. The specific line where that error comes from is here: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/agent/kubernetes/agent.py#L272
b
Thanks, knowing that it’s coming from the agent is interesting if nothing else. Maybe some opportunity for more logging in there, there seems to be gaps sometimes when jobs fail to start.