Hi all, we have two process work pools in Prefect ...
# prefect-cloud
j
Hi all, we have two process work pools in Prefect Cloud to run timing-sensitive flows. They stopped executing new flows yesterday around 01:00 UTC. I somehow managed to unblock one of them yesterday, but the second one is still stuck. I can see thousands of late flow runs in the pool. The pool has status Ready. Same for the queue. And the worker appears as Online. I tried restarting workers, moving flows to on-demand ECS push pools, and increasing CPU/memory resources, but no luck. I also tried to replicate this locally by running:
Copy code
PREFECT_LOGGING_LEVEL=DEBUG prefect worker start --pool $POOL_NAME --work-queue default --type process --with-healthcheck --install-policy never --prefetch-seconds 30
which does not discover any flow runs:
Copy code
Worker 'ProcessWorker 2f013cba-dfd4-4492-8887-dd8205a1da23' started!
13:42:11.306 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Worker synchronized with the Prefect API server.
13:42:16.310 | DEBUG   | prefect.utilities.services.critical_service_loop - Starting run of 'get_and_submit_flow_runs'
13:42:16.311 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Querying for flow runs scheduled before 2024-08-05T13:42:46.311643+00:00
13:42:16.313 | DEBUG   | prefect.utilities.services.critical_service_loop - Starting run of 'sync_with_backend'
13:42:16.314 | DEBUG   | prefect.utilities.services.critical_service_loop - Starting run of 'check_for_cancelled_flow_runs'
13:42:16.314 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Checking for cancelled flow runs...
13:42:16.464 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Discovered 0 scheduled_flow_runs
13:42:16.705 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Worker synchronized with the Prefect API server.
13:42:25.569 | DEBUG   | prefect.utilities.services.critical_service_loop - Starting run of 'get_and_submit_flow_runs'
13:42:25.570 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Querying for flow runs scheduled before 2024-08-05T13:42:55.570226+00:00
13:42:25.723 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Discovered 0 scheduled_flow_runs
13:42:37.262 | DEBUG   | prefect.utilities.services.critical_service_loop - Starting run of 'get_and_submit_flow_runs'
13:42:37.263 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Querying for flow runs scheduled before 2024-08-05T13:43:07.263616+00:00
13:42:37.419 | DEBUG   | prefect.worker.process.processworker 2f013cba-dfd4-4492-8887-dd8205a1da23 - Discovered 0 scheduled_flow_runs
Any idea what might be going on here?
workers are using
prefect==2.19.8
j
hey, it looks like you have a concurrency limit of 50 on that work pool. Both pending and running flow count as in-progress runs and take up concurrency slots. If all the concurrency slots are consumed, new runs won't be released to the worker. Can you check if you have any in progress runs and potentially remove them?
j
no running flows, at least not in the past 30 days
actually, i've tried removing the concurrency limit and I suddenly see this in our logs
i suppose there were old flow runs stuck in the running state is it possible to clean our database? I don't see a way to way to yank them from the UI as we're limited to the 30d interval
j
I'm glad things are running at least, let me take a look and see if I can find any old PENDING/RUNNING runs that might explain this
j
thanks for looking into that. let me know if you need anything from my end