Hey all Running into an issue with our prefect deployments t Prefect Community #ask-community

Hey all! Running into an issue with our prefect de...

Kiley Roberson

03/20/2025, 4:51 PM

Hey all! Running into an issue with our prefect deployments that I am wondering if anyone else has seen or has any thoughts on. We have a large quantity of flows that are running/queued, but each flow itself should not take very long. For some reason a large number of these flows are entering a running state and then never exit that state. The associated Job deployment for those flows gets cleaned up, but the flow itself still exists as running and is blocking the queue

Nate

03/20/2025, 4:53 PM

hi @Kiley Roberson - if im understanding correctly, this is a relatively common failure mode where (for example) infra disappears and never reports back, so prefect thinks its still running, which can cause slots to be hogged can you think of any reasons why your flow run infra would go off the map (OOM etc)?

Brian Oldak

03/20/2025, 5:18 PM

Hey Nate, I work with Kiley and am helping out trying to figure this out. We've certainly seen some examples where we get like a "Job reached backoff limit" exception on our kubernetes pod, and I believe this is usually an OOM exception. But there are other examples where there are no logs from the pod at all but when we go look for it it does not exist in our kubernetes cluster. If i'm understanding correctly, is what you're saying that if a pod dies, Prefect loses track of it and thus never completes/fails/crashes? Because I've seen tons of examples where we say hit an OOM and prefect is able to handle it and fail the flow. Not sure why this happens sometimes but not others.

Nate

03/20/2025, 5:21 PM

gotcha so its k8s we are working on improving the kubernetes worker to handle retries better (0.6.0a2) one thing you might want to look into is the heartbeat automation any details about specific cases would be helpful for us!

Nate

03/20/2025, 5:22 PM

> If i'm understanding correctly, is what you're saying that if a pod dies, Prefect loses track of it and thus never completes/fails/crashes? a couple caveats, yes if • the python process doesn't have time to clean up when the pod dies • you are not using the heartbeat automation that marks the run as crashed

Brian Oldak

03/20/2025, 5:23 PM

ok that looks interesting. can i create this automation via the UI?

Brian Oldak

03/20/2025, 5:24 PM

or does this need to be created on a per-flow basis?

Brian Oldak

03/20/2025, 5:25 PM

also do we need to do anything to ensure we are firing heartbeats? or is that done by default?

Nate

03/20/2025, 5:27 PM

you can do it via UI like any other automation and the docs section i linked explains which env var to set to emit heartbeats

✅ 1

Brian Oldak

03/20/2025, 5:28 PM

alright i'll give that a shot. i had tried creating an automation to just close flows that had run for > 24 hours but that did not work. but wasn't based on heartbeats

Nate

03/20/2025, 5:28 PM

cool! feel free to create a discussion / issue if something doesn't work as expected

Brian Oldak

03/20/2025, 6:11 PM

we set up the automation and it seems to be doing stuff so that is a good start! One thing I'm worried about though is we're on the free tier, and have run into rate limits in the past. i'm curious if this will contribute to that? would you recommend just expanding the interval?

Brian Oldak

03/20/2025, 6:41 PM

Screenshot 2025-03-20 at 11.41.00 AM.png

Brian Oldak

03/20/2025, 6:41 PM

getting pretty consistent rate limit errors now, but our chart shows that we're mostly under the limit

Brian Oldak

03/20/2025, 6:41 PM

the one anomaly was ~30 min ago

Nate

03/20/2025, 6:42 PM

heartbeats are based on events, the client-side emission of which do not contribute directly to orchestration rate limits however, how fast Prefect Cloud can receive these events is rate limited based on your orchestration rate limit for task run events

Nate

03/20/2025, 6:42 PM

what's your interval set at?

Brian Oldak

03/20/2025, 6:44 PM

@Kiley Roberson

Kiley Roberson

03/20/2025, 6:52 PM

Not totally sure where I should look to find that number, is that something that is specific to a work pool or is that just our default rate limit in the Rate Limit tab?

Brian Oldak

03/20/2025, 6:52 PM

is it heartbeat interval you're asking about?

Kiley Roberson

03/20/2025, 6:53 PM

the PREFECT_RUNNER_HEARTBEAT_FREQUENCY we set to 90

👍 1

Nate

03/20/2025, 6:53 PM

yes, ie how often are you emitting events you want Prefect Cloud to see

Nate

03/20/2025, 6:54 PM

hmm 90 is relatively infrequent so I would be very surprised if the introduction of heartbeats was causing rate limit breaches let me get a hold of someone who has access to account details (I do not)

✅ 1

gratitude thank you 1

Brian Oldak

03/20/2025, 6:58 PM

thanks Nate. Would love to get on a call with someone who can help us parse out what is going on here. the rate limit chart is pretty limited and we're honestly pretty confused + want to explore upgrading to pro

32 Views

Open in Slack

Previous Next