<@ULVA73B9P> I am using azure kubernetes service a...
# marvin-ai
t
@Marvin I am using azure kubernetes service and my prefect workers after upgrade versions have suddenly spiked in memory, I had limit of 1Gi Ram before, which has never been an issue, but now it OOMKILLED with 10Gi assigned, it completely spikes upon start. Can you find anything related to this? The worker was updated from 3.3.6 to 3.4.20
m
thought for 451.6 seconds
Thanks for the report — that sounds frustrating. I just searched our issues, discussions, and release notes and I’m not seeing a known regression in 3.4.x that would explain a Kubernetes worker jumping from ~1Gi to >10Gi on startup. Links for reference: - Prefect 3.4 release notes: 3.4.x release notes - Prefect issues: prefect issues - Prefect Kubernetes integration issues: prefect-kubernetes issues - Worker concepts: Workers To help us pinpoint this (and get you a workaround fast), can you share a few details? - Confirm worker type and how it’s started - Are you running a Kubernetes worker (i.e., pointing the worker at a Kubernetes work pool)? - Exact command used to start the worker (from the container entrypoint or the manifest). - Image and versions - Which image tag are you using (e.g.,
prefecthq/prefect:3.4.20-python3.11
)? - Python version in the image? - Version of
prefect-kubernetes
(if installed) or any other Prefect extras? - Deployment/Helm - Are you using the Prefect Helm chart? If so, which chart version and values for the worker? - Logs and timing - Does the memory spike before or after the worker prints that it has started and is polling the pool? - Any unusual log lines at DEBUG on startup? - Does downgrading the worker back to 3.3.6 immediately stop the spike? Fast isolation steps you can try now 1) Quick bisect to find the first bad version - Try 3.4.0 and 3.4.10 for the worker image. If 3.3.6 is fine and 3.4.20 spikes, this will tell us where it started. 2) Run the worker with DEBUG logging to see if we can spot where it blows up
Copy code
prefect worker start --pool <your-kubernetes-pool> --log-level DEBUG
3) Minimal local repro (optional) - If you can, run a worker container locally with the same tag and env (pointed at your Cloud/Server) to see if the spike reproduces off-cluster. That helps us determine if this is cluster/client related vs. package-level. 4) Lightweight memory trace during startup (optional) - If you can wrap the start command to sample memory during the first 60s, it helps narrow down the exact moment it spikes:
Copy code
python - <<'PY'
import os, time, subprocess, tracemalloc, shutil
tracemalloc.start(25)
p = subprocess.Popen(["prefect","worker","start","--pool", os.environ.get("PREFECT_POOL")])
for i in range(60):
    time.sleep(1)
    current, peak = tracemalloc.get_traced_memory()
    print(f"[{i:02d}s] current={current/1e6:.1f}MB peak={peak/1e6:.1f}MB", flush=True)
p.wait()
PY
Immediate workaround - If you’re blocked, pin the worker image back to 3.3.6 while we investigate. If you can provide the details above (image tag, startup command, worker type/pool, logs around startup, and whether a bisect isolates a specific 3.4.x), I’ll dig in further and escalate. If we confirm a regression, I’ll open an issue with a minimal repro to get it addressed quickly.