Hello, We have deployed our worker pods on K8s, wh...
# ask-community
s
Hello, We have deployed our worker pods on K8s, whenever a new deployment is done, it rolls out the older pods, when this happens, I see errors like
Copy code
10:31:12.107 | INFO    | prefect.worker.process.processworker bb9b7448-bd1f-4193-9bdc-b1e98a47771d - Found 1 flow runs awaiting cancellation.
10:31:12.136 | WARNING | prefect.worker.process.processworker bb9b7448-bd1f-4193-9bdc-b1e98a47771d - Unable to kill process 189: The process is running on a different host 'xxxxx'. Flow run cannot be cancelled by this worker.
After these no flows are running and all of them are late. Only when I manually cancel the ones in pending state, others start to run. Is there a way to fix this ?
@Marvin can you assist on this ?
m
Of course, I'm here to help. Could you please provide more details about what you need assistance with?
s
@Marvin We have deployed our worker pods on K8s, whenever a new deployment is done, it rolls out the older pods, when this happens, I see errors like
Copy code
10:31:12.107 | INFO    | prefect.worker.process.processworker bb9b7448-bd1f-4193-9bdc-b1e98a47771d - Found 1 flow runs awaiting cancellation.
10:31:12.136 | WARNING | prefect.worker.process.processworker bb9b7448-bd1f-4193-9bdc-b1e98a47771d - Unable to kill process 189: The process is running on a different host 'xxxxx'. Flow run cannot be cancelled by this worker.
After these no flows are running and all of them are late. Only when I manually cancel the ones in pending state, others start to run. Is there a way to fix this ?
m
It seems you're encountering an issue with Prefect workers on Kubernetes, where flow runs cannot be canceled by workers if they are running on a different host. This is a known challenge with Prefect's infrastructure cancellation process, which relies on a unique identifier scope to ensure the correct infrastructure is targeted. When the worker tries to cancel a flow run, it checks if the process is running on the same host. If not, it cannot proceed with the cancellation. Here are some key points and potential solutions: 1. Unique Identifier Scope: Prefect uses identifiers like the machine hostname and PID for processes. If these do not match, the worker cannot cancel the flow run. This is by design to prevent incorrect cancellations. Prefect Documentation on Cancellation 2. Known Issues: There have been reported issues where agents attempt to cancel flow runs in other work pools, leading to warnings and failures. One such issue is documented here. 3. Potential Solutions: - Ensure that your worker pods are correctly scoped and configured to handle cancellations. This might involve ensuring that the hostname and PID are correctly mapped and accessible. - Consider using a more robust deployment strategy that minimizes the need for cancellations, such as handling graceful shutdowns of pods or using persistent storage for stateful applications. 4. Manual Intervention: As a temporary workaround, manually canceling pending flow runs can help resume operations, but this is not ideal for automated environments. If these suggestions don't resolve your issue, you may want to explore more detailed configurations or reach out to Prefect support for further assistance.