Hello I m using a Kubernetes worker with Prefect 2 14 17 and Prefect Community #ask-community

Join Slack

Hello ! I'm using a Kubernetes worker (with Prefec...

# ask-community

Adrien Besnard

01/30/2024, 10:21 AM

Hello ! I'm using a Kubernetes worker (with Prefect 2.14.17) and jobs that trigger subflows are crashing ⬇️

Adrien Besnard

01/30/2024, 10:21 AM

Copy code

Discovered type 'kubernetes' for work pool 'default'.
Worker 'KubernetesWorker e764a855-8b77-47b6-8aa0-362493a75652' started!
10:12:52.074 | INFO    | prefect.flow_runs.worker - Worker 'KubernetesWorker e764a855-8b77-47b6-8aa0-362493a75652' submitting flow run '28a16102-2bf6-42d4-8b9d-1da08abd4a30'
10:12:52.750 | INFO    | prefect.flow_runs.worker - Creating Kubernetes job...
10:12:52.879 | INFO    | prefect.flow_runs.worker - Job 'fabulous-dugong-jdsqj': Pod has status 'Pending'.
10:12:52.938 | INFO    | prefect.flow_runs.worker - Completed submission of flow run '28a16102-2bf6-42d4-8b9d-1da08abd4a30'
10:13:52.850 | ERROR   | prefect.flow_runs.worker - Job 'fabulous-dugong-jdsqj': Pod never started.
10:13:53.200 | INFO    | prefect.flow_runs.worker - Job event 'SuccessfulCreate' at 2024-01-30 10:12:52+00:00: Created pod: fabulous-dugong-jdsqj-2hv4c
10:13:53.202 | INFO    | prefect.flow_runs.worker - Pod event 'Scheduled' at 2024-01-30 10:12:52.808564+00:00: Successfully assigned data-platform/fabulous-dugong-jdsqj-2hv4c to ip-10-2-0-84.eu-west-1.compute.internal
10:13:53.205 | INFO    | prefect.flow_runs.worker - Pod event 'Pulled' at 2024-01-30 10:12:53+00:00: Container image "<http://406151221390.dkr.ecr.eu-west-1.amazonaws.com/data-platform/metabase-reports:0.0.1-dev.464_7e69dd0|406151221390.dkr.ecr.eu-west-1.amazonaws.com/data-platform/metabase-reports:0.0.1-dev.464_7e69dd0>" already present on machine
10:13:53.206 | INFO    | prefect.flow_runs.worker - Pod event 'Created' at 2024-01-30 10:12:53+00:00: Created container prefect-job
10:13:53.208 | INFO    | prefect.flow_runs.worker - Pod event 'Started' at 2024-01-30 10:12:53+00:00: Started container prefect-job
10:13:53.408 | INFO    | prefect.flow_runs.worker - Reported flow run '28a16102-2bf6-42d4-8b9d-1da08abd4a30' as crashed: Flow run infrastructure exited with non-zero status code -1.

Adrien Besnard

01/30/2024, 10:21 AM

I took a look at the code, but I don't really understand what's going on, to be honest

Adrien Besnard

01/30/2024, 10:30 AM

Adrien Besnard

01/30/2024, 10:48 AM

The only workaround I found is putting insanely high values here :

Brandon

01/30/2024, 12:21 PM

Yes, I have noticed the same thing with this latest update. I had to downgrade the worker to use tag 2.14.15. It seems some sort of bug was introduced.

Nate

01/30/2024, 2:01 PM

hi @Adrien Besnard and @Brandon - thank you for the reports. i am looking into this now!

Nate

01/30/2024, 2:17 PM

are either of you able to provide the version of

prefect-kubernetes

that you have running when encountering the error?

Adrien Besnard

01/30/2024, 2:24 PM

So you mean within the base image used by the job or within the worker image ?

Nate

01/30/2024, 2:25 PM

the worker itself, are you using the helm chart to deploy the worker?

Adrien Besnard

01/30/2024, 2:29 PM

So I'm not using the official Helm Chart (I have a task to move to it). Here are the versions:

Copy code

5:28 $ kubectl exec -it prefect-worker-56d4bdc94d-dcz67 -- bash
root@prefect-worker-56d4bdc94d-dcz67:/opt/prefect# pip list | grep prefect
prefect                   2.14.17
prefect-kubernetes        0.3.3

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
root@prefect-worker-56d4bdc94d-dcz67:/opt/prefect#

Nate

01/30/2024, 2:32 PM

thank you!

Adrien Besnard

01/30/2024, 2:38 PM

I don't know if that's relevant, but I discovered that my base image (the one triggered by the worker) is stuck to Prefect 2.14.4: I think that one of the dependencies is keeping the version outdated.

Adrien Besnard

01/30/2024, 2:42 PM

I'll update it and see if it changes something.

Brandon

01/30/2024, 3:08 PM

I see this is also being tracked here https://github.com/PrefectHQ/prefect-kubernetes/issues/106 @Nate

Nate

01/30/2024, 5:27 PM

to update, we've identified an issue with 2.14.17 specifically related to the default host of the worker's health endpoint, which has been fixed and released in 2.14.18 as of a couple minutes ago this likely is not what both of you are encountering (but mentioning it in the off chance it is), and the released fix does not intend to address the issue linked above, which I suspect may be related to your issues @Adrien Besnard can you confirm whether you encounter the same issues as you showed logs for above on 2.14.16?

Adrien Besnard

01/30/2024, 7:33 PM

Unfortuately I was in 2.14.17 😞

Nate

01/30/2024, 8:01 PM

i was curious if the issue also existed in 2.14.16 i dont think that the fix we released today (2.14.18) will solve your problem, but you could try upgrading. Otherwise, we should probably create an issue on

prefect-kubernetes

to track this specific issue - since drastically increasing the watch timeouts definitely doesn't seem acceptable

Jacob Hurlbut

01/30/2024, 9:12 PM

I just ran into this myself on our AKS cluster. Upgraded from prefect 2.14.16 to 2.14.17 yesterday which led to all kinds of issues, tried going to 2.14.18 this afternoon and that didn’t fix it, and was finally able to get things stable again in our environment by pinning the worker image to prefect 2.14.16

Nate

01/30/2024, 9:16 PM

hi @Jacob Hurlbut - can you explain what issues you ran into? unfortunately there are a couple issues that have been reported related to k8s workers in the recent past, 2.14.18 fixed one of them (which AFAICT has not been reported in this thread)

Jacob Hurlbut

01/30/2024, 9:30 PM

Sure! It was super inconsistent; some flows would crash before the pod could start. Some flows would run tasks (per the task run viewer) but then the infrastructure would crash mid-execution. Then sometimes it would work perfectly fine (I don’t have the exact percentage handy, but the failure rate was probably between 30% and 40% with no discernible pattern)

Jacob Hurlbut

01/30/2024, 9:36 PM

Worth noting: to get our workspace back up and running again the only thing that needed to be pinned to 2.14.16 was the worker (via the helm config); the actual flows are still running using customized docker images that are based on the

prefecthq/prefect:2.14.18-python3.12-kubernetes

image

Nate

01/30/2024, 9:40 PM

okay, thank you for the context! we'll be looking into this ill just call out something strange I'm noticing here, which is that this "Pod never started" is coming through after we clearly go through the started / created / pulled events cc @Uriel Mandujano

Uriel Mandujano

01/30/2024, 9:55 PM

we recently addressed an issue around hanging kubernetes connections in

prefect-kubernetes v0.3.3

that we shipped in

prefect v2.14.17

. it's possible that something in that change is affecting flow runs. do the

prefect-worker

logs or pod logs have any more information about what's going on?

Jacob Hurlbut

01/30/2024, 10:19 PM

I just tried to sift through the logs from the worker and flow run pods and couldn’t find anything useful. It just bails out with an UnfinishedRun exception with “Task run {ID} received abort during orchestration: The enclosing flow must be running to begin task execution. Task run is in PENDING state.” right after beginning a task (in the case of a flow dying mid-run)

Uriel Mandujano

01/31/2024, 4:41 PM

I think we have a grip on what's going on and we're working on a couple of fixes to

prefect-kubernetes

incoming. For now, anyone using the helm chart should pin the helm chart they're using to avoid

prefect-kubernetes v0.3.3

(which are 2024.1.30 and 2024.1.25) and anyone directly using

prefect-kubernetes

should avoid v0.3.3. our current goal is to include the fix in tomorrow's release!

❤️ 2

Uriel Mandujano

02/01/2024, 9:57 PM

We just rolled out release

prefect-kubernetes v0.3.4

and released helm chart

v2024.2.1

which bundles the new

prefect-kubernetes

release with

prefect v2.14.19

. We expect this release to address the issue so please try it out at your convenience. Feel free to reply to this issue if you notice any more odd behavior coming from your prefect kubernetes workers!

Adrien Besnard

02/06/2024, 8:26 AM

Hello! I upgraded and I have indeed no more issue. Thanks a lot!

🙌 2

Ben Zehavi

02/15/2024, 1:58 PM

Amazing

9 Views

Open in Slack

Previous Next