<@ULVA73B9P> Does Kubernetes workpools supports ca...
# ask-marvin
a
@Marvin Does Kubernetes workpools supports cancelation?
m
thought for 155.9 seconds
Short answer: yes. In Prefect 3.x, Kubernetes work pools/workers support flow-run cancellation. When you cancel a run (via the UI or CLI), the Kubernetes worker deletes the Kubernetes Job for that run; Kubernetes then terminates the pod(s) using normal semantics (SIGTERM with the pod’s terminationGracePeriodSeconds, then SIGKILL if needed). What to know: - A running Kubernetes worker with proper RBAC is required to enforce the cancel. It must be able to delete Jobs (and, by owner reference, Pods) in the target namespace. - Prefect relies on the run’s infrastructure identifier (infrastructure_pid). If that’s missing, Prefect can mark the run as cancelled but cannot enforce stopping the Job. - You can customize shutdown behavior via your Job/Pod template (e.g., terminationGracePeriodSeconds). How to cancel from the CLI:
Copy code
prefect flow-run cancel <FLOW_RUN_ID>
Docs: - Flow/Task cancellation in Prefect 3 - Kubernetes deployments and workers - prefect-kubernetes integration - Workers overview If you want, I can share a minimal RBAC example for the worker service account and tips to troubleshoot cases where the pod doesn’t stop after cancelling.
a
@Marvin What is infrastructure_pid ?
m
thought for 445.9 seconds
Short answer:
infrastructure_pid
is the identifier Prefect stores on each flow run to uniquely reference the piece of infrastructure that’s executing it. Workers use it to enforce cancellations and cleanups. Details: - What it is - A string “infrastructure identifier” set by the worker when it starts your flow run. Despite the name, it’s not always a process PID; it’s whatever the worker needs to later find and terminate the run’s infrastructure. - It’s persisted on the flow run so cancellation can work even if the worker restarts. - Where it’s stored - On the flow run record as
infrastructure_pid
in the Prefect database and API response. - You can see it via: - CLI:
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
- Python:
Copy code
import asyncio
      from prefect.client.orchestration import get_client

      async def main(flow_run_id: str):
          async with get_client() as client:
              fr = await client.read_flow_run(flow_run_id)
              print(fr.infrastructure_pid)

      asyncio.run(main("<FLOW_RUN_ID>"))
- How it’s used - When you cancel a flow run, the worker retrieves
infrastructure_pid
and uses it to locate and terminate the underlying runtime (process, container, job, etc.). If it’s missing, Prefect will mark the run cancelled but cannot forcibly stop external infrastructure. - What it looks like (examples; exact format is worker-defined) - ProcessWorker: typically includes the host and OS process id (e.g., “hostname:pid” or a similar string). - DockerWorker: container identifier (typically the container ID, optionally qualified by a Docker context/API). - KubernetesWorker: job identifier (e.g., namespace/name or another job UID string). - In all cases, it’s a string returned by the worker as the
identifier
in its result. - 3.x vs 2.x - The concept is unchanged: it remains the flow run’s infrastructure identifier for cancellation. - What changed is ownership: in 3.x, workers and work pools replaced agents/infrastructure blocks; workers still set
infrastructure_pid
by returning a
BaseWorkerResult(identifier=..., status_code=...)
. Key references: - Cancel workflows in Prefect 3.x: Cancel workflows - Building custom workers (describes the worker result identifier): Developing a custom worker - Worker base result type (note the
identifier
field): prefect.workers.base.BaseWorkerResult - Flow run DB model (field exists as
infrastructure_pid
): orm_models.FlowRun - Flow run API responses (includes
infrastructure_pid
): schemas.responses.FlowRunResponse If you’re trying to use or interpret a specific value you’re seeing, share the worker type and the value and I can help decode it.
a
@Marvin What can be the cause of the fact that a cancel does not trigger the pod deletion
m
thought for 189.3 seconds
Assuming you’re using Prefect 3.x with the Kubernetes worker: a cancel transitions the flow run to Cancelling, then the worker uses the flow run’s
infrastructure_pid
to find and delete the K8s Job/Pod. If the pod isn’t being deleted, it’s usually one of these: - Missing infrastructure_pid - If the flow run never captured an identifier (e.g., job name/namespace), Prefect marks the run cancelled but cannot enforce deletion. - Worker offline or not polling - Cancellations are enforced by the worker. If it’s down, unhealthy, or pointed at a different work pool, it can’t perform the delete. - RBAC insufficient - The worker’s service account needs at least get/list/watch/create/delete on batch.jobs and v1.pods in the target namespace. Without delete, you’ll see 403 errors in worker logs. - Wrong cluster/namespace/context - The worker’s kubeconfig/context or work pool namespace doesn’t match where the Job/Pod actually lives. The worker can’t find what to delete. - Custom job template issues - If you’ve customized the Job template and it doesn’t create a standard Job that owns the Pod (ownerReferences), deleting the Job may not delete the Pod. - Policies/finalizers that block deletion - Finalizers, mutating/validating webhooks, or other cluster policies can leave resources stuck in Terminating. TTL controllers or “keep-on-completion” style policies can delay cleanup. - Timing/race conditions - If cancellation happens before the worker sets
infrastructure_pid
, or the worker can’t immediately reach the API, deletion may not occur or may be delayed. Quick checks to pinpoint the cause: - Verify the flow run has an infrastructure identifier
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
Look for
infrastructure_pid
. If it’s null/empty, the worker can’t enforce deletion. - Confirm the worker is up and polling the right pool/namespace; review its logs around the time of cancellation for messages like “Received cancellation request” or any 403/404/connection errors. - Test RBAC from the worker’s service account (or your context) in the target namespace:
Copy code
kubectl auth can-i delete job -n <NAMESPACE>
  kubectl auth can-i delete pods -n <NAMESPACE>
- Check the actual resource names and namespace match what
infrastructure_pid
indicates:
Copy code
kubectl get job -n <NAMESPACE>
  kubectl get pods -n <NAMESPACE> -l job-name=<JOB_NAME>
- Inspect the Job/Pod for finalizers/owner refs:
Copy code
kubectl get job <JOB_NAME> -n <NAMESPACE> -o yaml
  kubectl get pod <POD_NAME> -n <NAMESPACE> -o yaml
- Try manual deletion to isolate RBAC/policy issues:
Copy code
kubectl delete job <JOB_NAME> -n <NAMESPACE>
Helpful references: - Prefect 3.x cancellation overview: Cancel workflows - Worker anatomy and cancellation (BaseWorkerResult
identifier
): Developing a custom worker and source BaseWorkerResult - Flow run model (includes
infrastructure_pid
): FlowRun ORM - Kubernetes worker/integration docs: prefect-kubernetes integration If you can share: - Worker type and versions (
prefect
and
prefect-kubernetes
) - The
infrastructure_pid
value from
prefect flow-run inspect
- Worker logs around the cancellation - The Job’s namespace/name and a snippet of its YAML (ownerRefs/finalizers) …I can help pinpoint the exact root cause and fix.