< Marvin> Does Kubernetes workpools supports cancelation Prefect Community #ask-marvin

Join Slack

<@ULVA73B9P> Does Kubernetes workpools supports ca...

# ask-marvin

Adrien Besnard

10/16/2025, 3:42 PM

@Marvin Does Kubernetes workpools supports cancelation?

Marvin

10/16/2025, 3:42 PM

✅ thought for 155.9 seconds

Marvin

10/16/2025, 3:45 PM

Short answer: yes. In Prefect 3.x, Kubernetes work pools/workers support flow-run cancellation. When you cancel a run (via the UI or CLI), the Kubernetes worker deletes the Kubernetes Job for that run; Kubernetes then terminates the pod(s) using normal semantics (SIGTERM with the pod’s terminationGracePeriodSeconds, then SIGKILL if needed). What to know: - A running Kubernetes worker with proper RBAC is required to enforce the cancel. It must be able to delete Jobs (and, by owner reference, Pods) in the target namespace. - Prefect relies on the run’s infrastructure identifier (infrastructure_pid). If that’s missing, Prefect can mark the run as cancelled but cannot enforce stopping the Job. - You can customize shutdown behavior via your Job/Pod template (e.g., terminationGracePeriodSeconds). How to cancel from the CLI:

Copy code

prefect flow-run cancel <FLOW_RUN_ID>

Docs: - Flow/Task cancellation in Prefect 3 - Kubernetes deployments and workers - prefect-kubernetes integration - Workers overview If you want, I can share a minimal RBAC example for the worker service account and tips to troubleshoot cases where the pod doesn’t stop after cancelling.

Adrien Besnard

10/16/2025, 5:01 PM

@Marvin What is infrastructure_pid ?

Marvin

10/16/2025, 5:01 PM

✅ thought for 445.9 seconds

Marvin

10/16/2025, 5:09 PM

Short answer:

infrastructure_pid

is the identifier Prefect stores on each flow run to uniquely reference the piece of infrastructure that’s executing it. Workers use it to enforce cancellations and cleanups. Details: - What it is - A string “infrastructure identifier” set by the worker when it starts your flow run. Despite the name, it’s not always a process PID; it’s whatever the worker needs to later find and terminate the run’s infrastructure. - It’s persisted on the flow run so cancellation can work even if the worker restarts. - Where it’s stored - On the flow run record as

infrastructure_pid

in the Prefect database and API response. - You can see it via: - CLI:

Copy code

prefect flow-run inspect <FLOW_RUN_ID>

- Python:

Copy code

import asyncio
      from prefect.client.orchestration import get_client

      async def main(flow_run_id: str):
          async with get_client() as client:
              fr = await client.read_flow_run(flow_run_id)
              print(fr.infrastructure_pid)

      asyncio.run(main("<FLOW_RUN_ID>"))

- How it’s used - When you cancel a flow run, the worker retrieves

infrastructure_pid

and uses it to locate and terminate the underlying runtime (process, container, job, etc.). If it’s missing, Prefect will mark the run cancelled but cannot forcibly stop external infrastructure. - What it looks like (examples; exact format is worker-defined) - ProcessWorker: typically includes the host and OS process id (e.g., “hostname:pid” or a similar string). - DockerWorker: container identifier (typically the container ID, optionally qualified by a Docker context/API). - KubernetesWorker: job identifier (e.g., namespace/name or another job UID string). - In all cases, it’s a string returned by the worker as the

identifier

in its result. - 3.x vs 2.x - The concept is unchanged: it remains the flow run’s infrastructure identifier for cancellation. - What changed is ownership: in 3.x, workers and work pools replaced agents/infrastructure blocks; workers still set

infrastructure_pid

by returning a

BaseWorkerResult(identifier=..., status_code=...)

. Key references: - Cancel workflows in Prefect 3.x: Cancel workflows - Building custom workers (describes the worker result identifier): Developing a custom worker - Worker base result type (note the

identifier

field): prefect.workers.base.BaseWorkerResult - Flow run DB model (field exists as

infrastructure_pid

): orm_models.FlowRun - Flow run API responses (includes

infrastructure_pid

): schemas.responses.FlowRunResponse If you’re trying to use or interpret a specific value you’re seeing, share the worker type and the value and I can help decode it.

Adrien Besnard

10/16/2025, 5:15 PM

@Marvin What can be the cause of the fact that a cancel does not trigger the pod deletion

Marvin

10/16/2025, 5:15 PM

✅ thought for 189.3 seconds

Marvin

10/16/2025, 5:18 PM

Assuming you’re using Prefect 3.x with the Kubernetes worker: a cancel transitions the flow run to Cancelling, then the worker uses the flow run’s

infrastructure_pid

to find and delete the K8s Job/Pod. If the pod isn’t being deleted, it’s usually one of these: - Missing infrastructure_pid - If the flow run never captured an identifier (e.g., job name/namespace), Prefect marks the run cancelled but cannot enforce deletion. - Worker offline or not polling - Cancellations are enforced by the worker. If it’s down, unhealthy, or pointed at a different work pool, it can’t perform the delete. - RBAC insufficient - The worker’s service account needs at least get/list/watch/create/delete on batch.jobs and v1.pods in the target namespace. Without delete, you’ll see 403 errors in worker logs. - Wrong cluster/namespace/context - The worker’s kubeconfig/context or work pool namespace doesn’t match where the Job/Pod actually lives. The worker can’t find what to delete. - Custom job template issues - If you’ve customized the Job template and it doesn’t create a standard Job that owns the Pod (ownerReferences), deleting the Job may not delete the Pod. - Policies/finalizers that block deletion - Finalizers, mutating/validating webhooks, or other cluster policies can leave resources stuck in Terminating. TTL controllers or “keep-on-completion” style policies can delay cleanup. - Timing/race conditions - If cancellation happens before the worker sets

infrastructure_pid

, or the worker can’t immediately reach the API, deletion may not occur or may be delayed. Quick checks to pinpoint the cause: - Verify the flow run has an infrastructure identifier

Copy code

prefect flow-run inspect <FLOW_RUN_ID>

Look for

infrastructure_pid

. If it’s null/empty, the worker can’t enforce deletion. - Confirm the worker is up and polling the right pool/namespace; review its logs around the time of cancellation for messages like “Received cancellation request” or any 403/404/connection errors. - Test RBAC from the worker’s service account (or your context) in the target namespace:

Copy code

kubectl auth can-i delete job -n <NAMESPACE>
  kubectl auth can-i delete pods -n <NAMESPACE>

- Check the actual resource names and namespace match what

infrastructure_pid

indicates:

Copy code

kubectl get job -n <NAMESPACE>
  kubectl get pods -n <NAMESPACE> -l job-name=<JOB_NAME>

- Inspect the Job/Pod for finalizers/owner refs:

Copy code

kubectl get job <JOB_NAME> -n <NAMESPACE> -o yaml
  kubectl get pod <POD_NAME> -n <NAMESPACE> -o yaml

- Try manual deletion to isolate RBAC/policy issues:

Copy code

kubectl delete job <JOB_NAME> -n <NAMESPACE>

Helpful references: - Prefect 3.x cancellation overview: Cancel workflows - Worker anatomy and cancellation (BaseWorkerResult

identifier

): Developing a custom worker and source BaseWorkerResult - Flow run model (includes

infrastructure_pid

): FlowRun ORM - Kubernetes worker/integration docs: prefect-kubernetes integration If you can share: - Worker type and versions (

prefect

and

prefect-kubernetes

) - The

infrastructure_pid

value from

prefect flow-run inspect

- Worker logs around the cancellation - The Job’s namespace/name and a snippet of its YAML (ownerRefs/finalizers) …I can help pinpoint the exact root cause and fix.

3 Views

Open in Slack

Previous Next