<@ULVA73B9P> all of my runs are suddenly stuck int...
# marvin-in-the-wild
t
@Marvin all of my runs are suddenly stuck into "late" status. What's going on and how to debug the issue?
m
Based on the information from the GitHub issues and Prefect documentation, here are some potential reasons and steps to debug your runs being stuck in the late status: 1. Prevent Running Old Flow Runs: There's a known issue where flows that didn't start initially are marked as late and still get picked up and run much later. This can cause unexpected behavior, especially if the flow is very late (e.g., 25 days or more). GitHub issue #9054 discusses configuring a maximum threshold for late runs, transitioning them to a Failed or Crashed state instead of running them. 2. Capacity Limiter Errors: Errors like "this borrower is already holding one of this CapacityLimiter's tokens" can cause agents to stop processing new tasks, effectively stalling your runs. Check if there's an issue with your agents hanging due to not handling new flow schedules. Detailed in issue #10632. 3. Work Pools Configuration: Ensure that your work pools are configured correctly. Running deployments in pools other than the default one might cause agents to raise warnings and not execute the flow run, as explained in issue #9569. 4. Debugging Late Flow Runs: The Prefect documentation provides a method to reschedule late flow runs or delete them. Here's a code snippet to help you reschedule or delete late flow runs:
Copy code
from datetime import datetime, timedelta, timezone
   from prefect.client import get_client
   from prefect.filters import FlowRunFilter, DeploymentFilter, FlowRunSort
   from prefect.states import Scheduled

   async def reschedule_late_flow_runs(deployment_name: str, delay: timedelta, most_recent_n: int, delete_remaining: bool = True) -> list:
       states = ["Late"]
       async with get_client() as client:
           flow_runs = await client.read_flow_runs(
               flow_run_filter=FlowRunFilter(state=dict(name=dict(any_=states)), expected_start_time=dict(before_=datetime.now(timezone.utc))),
               deployment_filter=DeploymentFilter(name={'like_': deployment_name}),
               sort=FlowRunSort.START_TIME_DESC,
               limit=most_recent_n if not delete_remaining else None
           )

           if not flow_runs:
               print(f"No flow runs found in states: {states!r}")
               return []

           rescheduled_flow_runs = []
           for i, run in enumerate(flow_runs):
               await client.delete_flow_run(flow_run_id=run.id)
               if i < most_recent_n:
                   new_run = await client.create_flow_run_from_deployment(
                       deployment_id=run.deployment_id,
                       state=Scheduled(scheduled_time=run.expected_start_time + delay)
                   )
                   rescheduled_flow_runs.append(new_run)
           
           return rescheduled_flow_runs

   # Example usage
   if __name__ == "__main__":
       from asyncio import run
       rescheduled_flow_runs = run(reschedule_late_flow_runs(deployment_name="your-deployment-name", delay=timedelta(hours=6), most_recent_n=3))
       print(f"Rescheduled {len(rescheduled_flow_runs)} flow runs")
By following these steps, you should be able to identify and resolve the issue of your runs being stuck in the late status. If the problem persists, please refer to the mentioned GitHub issues and documentation links for more context and potential fixes. - Prevent running late runs: https://github.com/PrefectHQ/prefect/issues/9054 - Capacity limiter errors: https://github.com/PrefectHQ/prefect/issues/10632 - Agents canceling wrong work pools: https://github.com/PrefectHQ/prefect/issues/9569 - Reschedule late flow runs: Documentation provided above.
n
the most direct answer with lots of late runs is: "check your worker, it might have died"
if runs continue to be scheduled but they're not getting picked up, maybe no ones there to pick them up
t
@Nate thanks for your pointer. How do we make sure if they're getting picked up? I looked at home dash and it says that work queues is ready.
I'm using Prefect-managed resources
n
I'm using Prefect-managed resources
hmm, do you mean that you're using a prefect managed work pool?
t
yes
n
would you be willing to DM a flow run URL?
t
just dm'ed you. Thanks!
n
hey @Tita i dont see anything so weird, I'm wondering about concurrency slots being filled though i see some of your pools have concurrency limits set, so I would check: • what is your shared concurrency limit? often its 10 • do you have any old runs hanging around? (go to the
Runs
page and filter for runs in
Running
state)
t
total concurrency limit is 10. Btw just had a call with Prefect reps, Mihir. The problem is solved for now. What we did was we remove most of the crashed, failed, and late runs. If a run is failed or crashed, does it still get counted in the concurrent runs? And does it still consume memory?
n
the ones in
Running
state will count toward your limit and whether or not they're consuming memory depends on whether the container is actually alive and running unfortunately, inevitably, sometimes the prefect API loses track of a container while its running, leaving the flow run in a
Running
state even though the container has exited does that answer your question?
t
got it, thanks!
catjam 1