Hello Prefect team, We are having some complicati...
# prefect-kubernetes
x
Hello Prefect team, We are having some complication with running long running flow in Prefect Kubernetes, details in 🧵 Thanks!
Our flow runs can extend beyond 3 days due to their design to process massive amounts of data. Thought the way of how we made it work, each task can be initiated within the flow processes to only process a small batch of data, with each task typically taking less than 3 minutes to complete. The extended duration of the flow runs presents a higher risk of Kubernetes shutting down the associated pods. We've already observed instances where pods were terminated after running for approximately 25 hours. In these cases, it appears that when the flow pods are shut down, the flow run fails to recover automatically. The Prefect UI simply continues to display the last task initiated by the flow (designed for processing small data batches) as ongoing, rather than recognizing it as a failure. In order to deal with such issue, we have introduced timeouts into our tasks:
Copy code
@task(
    persist_result=True
    retries=2,
    retry_delay_seconds=10,
    timeout_seconds=60 * 30
)
def process_batch():
    # business logic
But we found that this approach still doesn't address our issue, the task can still take up to a coupe of hours without tring to fail on its own. We do observed that in cases when pod gets removed, we can still manually pause the job, and then resume then, and that will bring in a new flow pod which will continue the running process on the last unfinished task. But Do you have any suggestions on steps we could take to enable the flow run to recover automatically upon flow pod removal?
(btw, we are on prefect cloud)
m
Created an issue for this, we are seeing it in a few scenarios. https://github.com/PrefectHQ/prefect-kubernetes/issues/120