Hey all! Running into an issue with our prefect de...
# ask-community
k
Hey all! Running into an issue with our prefect deployments that I am wondering if anyone else has seen or has any thoughts on. We have a large quantity of flows that are running/queued, but each flow itself should not take very long. For some reason a large number of these flows are entering a running state and then never exit that state. The associated Job deployment for those flows gets cleaned up, but the flow itself still exists as running and is blocking the queue
n
hi @Kiley Roberson - if im understanding correctly, this is a relatively common failure mode where (for example) infra disappears and never reports back, so prefect thinks its still running, which can cause slots to be hogged can you think of any reasons why your flow run infra would go off the map (OOM etc)?
b
Hey Nate, I work with Kiley and am helping out trying to figure this out. We've certainly seen some examples where we get like a "Job reached backoff limit" exception on our kubernetes pod, and I believe this is usually an OOM exception. But there are other examples where there are no logs from the pod at all but when we go look for it it does not exist in our kubernetes cluster. If i'm understanding correctly, is what you're saying that if a pod dies, Prefect loses track of it and thus never completes/fails/crashes? Because I've seen tons of examples where we say hit an OOM and prefect is able to handle it and fail the flow. Not sure why this happens sometimes but not others.
n
gotcha so its k8s we are working on improving the kubernetes worker to handle retries better (0.6.0a2) one thing you might want to look into is the heartbeat automation any details about specific cases would be helpful for us!
> If i'm understanding correctly, is what you're saying that if a pod dies, Prefect loses track of it and thus never completes/fails/crashes? a couple caveats, yes if • the python process doesn't have time to clean up when the pod dies • you are not using the heartbeat automation that marks the run as crashed
b
ok that looks interesting. can i create this automation via the UI?
or does this need to be created on a per-flow basis?
also do we need to do anything to ensure we are firing heartbeats? or is that done by default?
n
you can do it via UI like any other automation and the docs section i linked explains which env var to set to emit heartbeats
āœ… 1
b
alright i'll give that a shot. i had tried creating an automation to just close flows that had run for > 24 hours but that did not work. but wasn't based on heartbeats
n
cool! feel free to create a discussion / issue if something doesn't work as expected
b
we set up the automation and it seems to be doing stuff so that is a good start! One thing I'm worried about though is we're on the free tier, and have run into rate limits in the past. i'm curious if this will contribute to that? would you recommend just expanding the interval?
Screenshot 2025-03-20 at 11.41.00 AM.png
getting pretty consistent rate limit errors now, but our chart shows that we're mostly under the limit
the one anomaly was ~30 min ago
n
heartbeats are based on events, the client-side emission of which do not contribute directly to orchestration rate limits however, how fast Prefect Cloud can receive these events is rate limited based on your orchestration rate limit for task run events
what's your interval set at?
b
@Kiley Roberson
k
Not totally sure where I should look to find that number, is that something that is specific to a work pool or is that just our default rate limit in the Rate Limit tab?
b
is it heartbeat interval you're asking about?
k
the PREFECT_RUNNER_HEARTBEAT_FREQUENCY we set to 90
šŸ‘ 1
n
yes, ie how often are you emitting events you want Prefect Cloud to see
hmm 90 is relatively infrequent so I would be very surprised if the introduction of heartbeats was causing rate limit breaches let me get a hold of someone who has access to account details (I do not)
āœ… 1
gratitude thank you 1
b
thanks Nate. Would love to get on a call with someone who can help us parse out what is going on here. the rate limit chart is pretty limited and we're honestly pretty confused + want to explore upgrading to pro