We ve noticed that Prefect is sometimes unable to clean up a Prefect Community #ask-community

We’ve noticed that Prefect is sometimes unable to ...

Tom Jordahl

12/03/2024, 4:56 PM

We’ve noticed that Prefect is sometimes unable to clean up after flows or tasks that fail unexpectedly without allowing Prefect exception hooks to run (for example, if a debugged local run is abruptly exited, or if a flow run crashes on infrastructure). This can leave “zombie” flows that show as “running” in the UI for days when in reality, they stopped a long time ago. This also may occupy a task concurrency slot that will never be freed up because the task exited abruptly before cleanup code executed. We need a way to clean up the flows and task concurrency slots that are left as zombies. Anyone from Prefect have a cleanup flow? Does anyone have some cleanup code they are willing to share? CC: @Marvin

Marvin

12/03/2024, 4:56 PM

Ah, the delightful world of "zombie" flows and tasks. It seems like they have more lifetimes than a cat! If you are using Prefect 2.x and encounter these "zombie" flows, you might need to manually delete the pending flows that were left dangling after an agent's abrupt stop. You can find a reference for this issue in this documentation. For Prefect 3.x, managing task and flow concurrency involves setting global concurrency limits with slot decay options. This approach helps in preventing slots from being perpetually occupied by releasing them over time. You can configure

slot_decay_per_second

to control the pace at which slots are released. Details on this can be found here. In both versions, consider using a script to cancel any flow runs stuck in a "Running" state for an unreasonable period. Here's a snippet to cancel such flows, found in Prefect 2.x documentation:

Copy code

python
async def cancel_flow_runs(flow_runs):
    async with get_client() as client:
        for idx, flow_run in enumerate(flow_runs):
            print(f"[{idx + 1}] Cancelling flow run '{flow_run.name}' with ID '{flow_run.id}'")
            state_updates = {"name": "Cancelled"}

By automating this sort of clean-up, you give those over-occupied task concurrency slots a chance to breathe again. I promise, even AI doesn't enjoy being kept in a zombie-like, "always working, never resting" state! If you wish to dive deeper into any specific areas, feel free to nudge me again!

Bianca Hoch

12/03/2024, 9:12 PM

Hi Tom! For cleanup, have you considered using an automation to enforce SLAs? ie: If a flow stays in running for >10 hours, cancel the run.

Bianca Hoch

12/03/2024, 9:24 PM

Otherwise, if you'd rather handle this using the client, here is an example you can adapt to your needs. As it currently stands, it looks for flow runs stuck in "bad states" and then deletes them. I'd imagine* you could tweak this to filter by the start time of the flow (check for those that are older than X amount of time), and then force them into a failed state (or delete them).

Tom Jordahl

12/03/2024, 10:30 PM

Hi Bianca, an automation sounds like it might be a good idea. thanks for the code pointer!

🙌 1

💯 1

45 Views

Open in Slack

Previous Next