Hi all here s a weird case we re running into today A flow r Prefect Community #ask-community

Hi all, here’s a weird case we’re running into tod...

Brian Mesick

02/26/2021, 5:29 PM

Hi all, here’s a weird case we’re running into today. A flow running on K8s that hasn’t changed in a few wks has started failing consistently with:

Copy code

Container 'flow' state: terminated
		Exit Code:: 2
		Reason: Error

The job shows up submitted, then ~30 seconds later we get 4 of those errors. Other nearly identical flows are working. The pods don’t hang around long enough to get logs (if they even start). Has anyone else run into this?

Brian Mesick

02/26/2021, 5:34 PM

We’re not doing anything particularly fancy in configuration. KubernetesRun / LocalDaskExecutor / Docker storage:

Copy code

run_config = KubernetesRun(cpu_request=1, memory_request="2Gi", labels=["prod"])
executor = LocalDaskExecutor(scheduler="threads", num_workers=4)

storage = Docker(
    registry_url="<redacted>.<http://dkr.ecr.us-east-1.amazonaws.com|dkr.ecr.us-east-1.amazonaws.com>",
    image_name="...",
    image_tag="v2",
    base_image="python:3.8",
    python_dependencies=[
...
    ],
)

Brian Mesick

02/26/2021, 5:37 PM

Agent logs show it being scheduled, but nothing else.

nicholas

02/26/2021, 5:42 PM

Hi @Brian Mesick - where are you seeing those initial logs?

Brian Mesick

02/26/2021, 6:03 PM

Hi @nicholas they’re the flow run logs.

Brian Mesick

02/26/2021, 6:03 PM

Missed the first line, which is

Pod prefect-job-33... failed.

nicholas

02/26/2021, 6:12 PM

Ah ok, thanks! Is there a possibility that your k8s cluster is hitting resource constraints?

Brian Mesick

02/26/2021, 6:14 PM

In theory, though it shouldn’t have been doing anything at that point and other jobs with the same resource requests have succeeded while that one continues to fail.

nicholas

02/26/2021, 6:16 PM

Got it - and you mentioned that the pods aren't around long enough to generate any logs, I'm guessing you mean from inspecting the job logs themselves?

Brian Mesick

02/26/2021, 6:18 PM

Correct

Brian Mesick

02/26/2021, 6:18 PM

We’re pushing a new version now as that seems to sometimes clear these things, just curious if it was a known issue

nicholas

02/26/2021, 6:21 PM

Hm it's not but I'd be very curious what you find when pushing a new version.

Brian Mesick

02/26/2021, 6:21 PM

We did recently run into a cpu constraint causing jobs to now schedule, but the logs for that were pretty clear that the cluster couldn’t scale

nicholas

02/26/2021, 6:25 PM

Yeah that sounds straightforward enough - did anything with the k8s template on your agent change?

Brian Mesick

02/26/2021, 6:29 PM

I don’t believe so. Agent has been running for a couple of days and had run this job once or twice before without issue

Brian Mesick

02/26/2021, 6:33 PM

I’d expect that

error code 2

is coming from Docker / Prefect running the flow. Any idea what that result code might mean?

nicholas

02/26/2021, 6:34 PM

oh that's interesting, i think that error code is related to a bad call in the container startup command

Brian Mesick

02/26/2021, 6:37 PM

Hmm, is that something Prefect is doing behind the scenes?

Brian Mesick

02/26/2021, 8:29 PM

FWIW pushing a new image with the same code and versions solved the problem. ¯\_(ツ)_/¯

nicholas

02/26/2021, 8:58 PM

Hi @Brian Mesick - sorry for the slow response over here! If you see this start to crop up again, let me know. It could be that there's an image pull backoff or something (though that would normally be reflected in logs)

nicholas

02/26/2021, 8:59 PM

To answer your earlier question: yes sort of, in that Prefect submits jobs and has to init the container somehow. The specific line where that error comes from is here: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/agent/kubernetes/agent.py#L272

Brian Mesick

02/26/2021, 9:34 PM

Thanks, knowing that it’s coming from the agent is interesting if nothing else. Maybe some opportunity for more logging in there, there seems to be gaps sometimes when jobs fail to start.

3 Views

Open in Slack

Previous Next