https://prefect.io logo
j

Joish

07/26/2023, 7:15 AM
Hello, I have a long-running job that involves a parent flow initiating 30 child flows. Each child flow consists of 13 to 15 long-running tasks, with each task taking approximately 1 hour to 1 hour 15 minutes. These tasks run in parallel, but they start based on the availability of workers. Currently, we have 3 workers, which means 3 tasks run in parallel, and as each worker finishes, new tasks are assigned to them. It's important to note that we are using Kubernetes, and each flow is represented as a job in our K8s cluster. Issue 1: Out of the 30 flows, 3 completed without any problems. However, the remaining 27 flows are still running, with one task each in a pending state. Upon examining the logs in both the Prefect Cloud UI and the K8s job pod, there were no errors. Nevertheless, the Prefect agent reported a specific error:
Copy code
### Error trace-stack ###
17:59:25.058 | ERROR   | prefect.agent - An error occured while monitoring flow run 'fb822b8d-52ba-4781-a511-0a7bf3694308'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 485, in _submit_run_and_capture_errors
    result = await infrastructure.run(task_status=task_status)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 308, in run
    status_code = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 668, in _watch_job
    for event in watch.stream(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 170074095 (170075178)
Has anyone encountered this issue before? Are there any suggestions on how to address it? Issue 2: To resolve the issue with the pending tasks, I attempted to retry them. While the retry worked properly in the UI, it ended up creating new jobs without cleaning up the old unfinished jobs in the K8s cluster. Strangely, the newly created jobs were cleaned up correctly once their respective tasks were completed. Any insights or advice on dealing with this behaviour would be greatly appreciated. Thanks in advance
d

Deceivious

07/26/2023, 8:26 AM
That is interesting. We have similar setup. But we dont do stuff in parallel, we let prefect deal with the parallelism with run_deployment call. I have around 400+ scheduled runs right now , been that way for over 1 week now. Have not seen this issue. I believe the runs transits to Pending / Running state only when a job has been created by the worker. Its interesting to see that your flows are all in running state. Either way, following this thread.
j

Joish

07/27/2023, 5:02 PM
#prefect-community Any insights or advice....
d

Deceivious

07/27/2023, 5:03 PM
What was the status of older pods? In Kubernetes
j

Joish

07/27/2023, 5:28 PM
running
i had to manually go delete them
t

Tanishq Hooda

09/25/2023, 9:43 AM
Hi @Joish facing a similar issue, were you able to resolve it ?
j

Joish

09/25/2023, 9:45 AM
Yeah Not completely though
t

Tanishq Hooda

09/25/2023, 9:46 AM
j

Joish

09/25/2023, 9:47 AM
Not sure on this But let me check this