Hello ! I'm using a Kubernetes worker (with Prefec...
# ask-community
a
Hello ! I'm using a Kubernetes worker (with Prefect 2.14.17) and jobs that trigger subflows are crashing ⬇️
Copy code
Discovered type 'kubernetes' for work pool 'default'.
Worker 'KubernetesWorker e764a855-8b77-47b6-8aa0-362493a75652' started!
10:12:52.074 | INFO    | prefect.flow_runs.worker - Worker 'KubernetesWorker e764a855-8b77-47b6-8aa0-362493a75652' submitting flow run '28a16102-2bf6-42d4-8b9d-1da08abd4a30'
10:12:52.750 | INFO    | prefect.flow_runs.worker - Creating Kubernetes job...
10:12:52.879 | INFO    | prefect.flow_runs.worker - Job 'fabulous-dugong-jdsqj': Pod has status 'Pending'.
10:12:52.938 | INFO    | prefect.flow_runs.worker - Completed submission of flow run '28a16102-2bf6-42d4-8b9d-1da08abd4a30'
10:13:52.850 | ERROR   | prefect.flow_runs.worker - Job 'fabulous-dugong-jdsqj': Pod never started.
10:13:53.200 | INFO    | prefect.flow_runs.worker - Job event 'SuccessfulCreate' at 2024-01-30 10:12:52+00:00: Created pod: fabulous-dugong-jdsqj-2hv4c
10:13:53.202 | INFO    | prefect.flow_runs.worker - Pod event 'Scheduled' at 2024-01-30 10:12:52.808564+00:00: Successfully assigned data-platform/fabulous-dugong-jdsqj-2hv4c to ip-10-2-0-84.eu-west-1.compute.internal
10:13:53.205 | INFO    | prefect.flow_runs.worker - Pod event 'Pulled' at 2024-01-30 10:12:53+00:00: Container image "<http://406151221390.dkr.ecr.eu-west-1.amazonaws.com/data-platform/metabase-reports:0.0.1-dev.464_7e69dd0|406151221390.dkr.ecr.eu-west-1.amazonaws.com/data-platform/metabase-reports:0.0.1-dev.464_7e69dd0>" already present on machine
10:13:53.206 | INFO    | prefect.flow_runs.worker - Pod event 'Created' at 2024-01-30 10:12:53+00:00: Created container prefect-job
10:13:53.208 | INFO    | prefect.flow_runs.worker - Pod event 'Started' at 2024-01-30 10:12:53+00:00: Started container prefect-job
10:13:53.408 | INFO    | prefect.flow_runs.worker - Reported flow run '28a16102-2bf6-42d4-8b9d-1da08abd4a30' as crashed: Flow run infrastructure exited with non-zero status code -1.
I took a look at the code, but I don't really understand what's going on, to be honest
The only workaround I found is putting insanely high values here :
b
Yes, I have noticed the same thing with this latest update. I had to downgrade the worker to use tag 2.14.15. It seems some sort of bug was introduced.
n
hi @Adrien Besnard and @Brandon - thank you for the reports. i am looking into this now!
are either of you able to provide the version of
prefect-kubernetes
that you have running when encountering the error?
a
So you mean within the base image used by the job or within the worker image ?
n
the worker itself, are you using the helm chart to deploy the worker?
a
So I'm not using the official Helm Chart (I have a task to move to it). Here are the versions:
Copy code
5:28 $ kubectl exec -it prefect-worker-56d4bdc94d-dcz67 -- bash
root@prefect-worker-56d4bdc94d-dcz67:/opt/prefect# pip list | grep prefect
prefect                   2.14.17
prefect-kubernetes        0.3.3

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
root@prefect-worker-56d4bdc94d-dcz67:/opt/prefect#
n
thank you!
a
I don't know if that's relevant, but I discovered that my base image (the one triggered by the worker) is stuck to Prefect 2.14.4: I think that one of the dependencies is keeping the version outdated.
I'll update it and see if it changes something.
b
I see this is also being tracked here https://github.com/PrefectHQ/prefect-kubernetes/issues/106 @Nate
n
to update, we've identified an issue with 2.14.17 specifically related to the default host of the worker's health endpoint, which has been fixed and released in 2.14.18 as of a couple minutes ago this likely is not what both of you are encountering (but mentioning it in the off chance it is), and the released fix does not intend to address the issue linked above, which I suspect may be related to your issues @Adrien Besnard can you confirm whether you encounter the same issues as you showed logs for above on 2.14.16?
a
Unfortuately I was in 2.14.17 😞
n
i was curious if the issue also existed in 2.14.16 i dont think that the fix we released today (2.14.18) will solve your problem, but you could try upgrading. Otherwise, we should probably create an issue on
prefect-kubernetes
to track this specific issue - since drastically increasing the watch timeouts definitely doesn't seem acceptable
j
I just ran into this myself on our AKS cluster. Upgraded from prefect 2.14.16 to 2.14.17 yesterday which led to all kinds of issues, tried going to 2.14.18 this afternoon and that didn’t fix it, and was finally able to get things stable again in our environment by pinning the worker image to prefect 2.14.16
n
hi @Jacob Hurlbut - can you explain what issues you ran into? unfortunately there are a couple issues that have been reported related to k8s workers in the recent past, 2.14.18 fixed one of them (which AFAICT has not been reported in this thread)
j
Sure! It was super inconsistent; some flows would crash before the pod could start. Some flows would run tasks (per the task run viewer) but then the infrastructure would crash mid-execution. Then sometimes it would work perfectly fine (I don’t have the exact percentage handy, but the failure rate was probably between 30% and 40% with no discernible pattern)
Worth noting: to get our workspace back up and running again the only thing that needed to be pinned to 2.14.16 was the worker (via the helm config); the actual flows are still running using customized docker images that are based on the
prefecthq/prefect:2.14.18-python3.12-kubernetes
image
n
okay, thank you for the context! we'll be looking into this ill just call out something strange I'm noticing here, which is that this "Pod never started" is coming through after we clearly go through the started / created / pulled events cc @Uriel Mandujano
u
we recently addressed an issue around hanging kubernetes connections in
prefect-kubernetes v0.3.3
that we shipped in
prefect v2.14.17
. it's possible that something in that change is affecting flow runs. do the
prefect-worker
logs or pod logs have any more information about what's going on?
j
I just tried to sift through the logs from the worker and flow run pods and couldn’t find anything useful. It just bails out with an UnfinishedRun exception with “Task run {ID} received abort during orchestration: The enclosing flow must be running to begin task execution. Task run is in PENDING state.” right after beginning a task (in the case of a flow dying mid-run)
u
I think we have a grip on what's going on and we're working on a couple of fixes to
prefect-kubernetes
incoming. For now, anyone using the helm chart should pin the helm chart they're using to avoid
prefect-kubernetes v0.3.3
(which are 2024.1.30 and 2024.1.25) and anyone directly using
prefect-kubernetes
should avoid v0.3.3. our current goal is to include the fix in tomorrow's release!
❤️ 2
We just rolled out release
prefect-kubernetes v0.3.4
and released helm chart
v2024.2.1
which bundles the new
prefect-kubernetes
release with
prefect v2.14.19
. We expect this release to address the issue so please try it out at your convenience. Feel free to reply to this issue if you notice any more odd behavior coming from your prefect kubernetes workers!
a
Hello! I upgraded and I have indeed no more issue. Thanks a lot!
🙌 2
b
Amazing