Hello Prefect team We are having some complication with runn Prefect Community #prefect-kubernetes

Hello Prefect team, We are having some complicati...

Xing Zeng

03/06/2024, 9:21 PM

Hello Prefect team, We are having some complication with running long running flow in Prefect Kubernetes, details in 🧵 Thanks!

Xing Zeng

03/06/2024, 9:21 PM

Our flow runs can extend beyond 3 days due to their design to process massive amounts of data. Thought the way of how we made it work, each task can be initiated within the flow processes to only process a small batch of data, with each task typically taking less than 3 minutes to complete. The extended duration of the flow runs presents a higher risk of Kubernetes shutting down the associated pods. We've already observed instances where pods were terminated after running for approximately 25 hours. In these cases, it appears that when the flow pods are shut down, the flow run fails to recover automatically. The Prefect UI simply continues to display the last task initiated by the flow (designed for processing small data batches) as ongoing, rather than recognizing it as a failure. In order to deal with such issue, we have introduced timeouts into our tasks:

Copy code

@task(
    persist_result=True
    retries=2,
    retry_delay_seconds=10,
    timeout_seconds=60 * 30
)
def process_batch():
    # business logic

But we found that this approach still doesn't address our issue, the task can still take up to a coupe of hours without tring to fail on its own. We do observed that in cases when pod gets removed, we can still manually pause the job, and then resume then, and that will bring in a new flow pod which will continue the running process on the last unfinished task. But Do you have any suggestions on steps we could take to enable the flow run to recover automatically upon flow pod removal?

Xing Zeng

03/06/2024, 9:24 PM

(btw, we are on prefect cloud)

Max Eggers

03/07/2024, 5:09 PM

Created an issue for this, we are seeing it in a few scenarios. https://github.com/PrefectHQ/prefect-kubernetes/issues/120

3 Views

Open in Slack

Previous Next