Adrien Besnard
10/16/2025, 3:42 PMMarvin
10/16/2025, 3:42 PMMarvin
10/16/2025, 3:45 PMprefect flow-run cancel <FLOW_RUN_ID>
Docs:
- Flow/Task cancellation in Prefect 3
- Kubernetes deployments and workers
- prefect-kubernetes integration
- Workers overview
If you want, I can share a minimal RBAC example for the worker service account and tips to troubleshoot cases where the pod doesn’t stop after cancelling.Adrien Besnard
10/16/2025, 5:01 PMMarvin
10/16/2025, 5:01 PMMarvin
10/16/2025, 5:09 PMinfrastructure_pid is the identifier Prefect stores on each flow run to uniquely reference the piece of infrastructure that’s executing it. Workers use it to enforce cancellations and cleanups.
Details:
- What it is
- A string “infrastructure identifier” set by the worker when it starts your flow run. Despite the name, it’s not always a process PID; it’s whatever the worker needs to later find and terminate the run’s infrastructure.
- It’s persisted on the flow run so cancellation can work even if the worker restarts.
- Where it’s stored
- On the flow run record as infrastructure_pid in the Prefect database and API response.
- You can see it via:
- CLI:
prefect flow-run inspect <FLOW_RUN_ID>
- Python:
import asyncio
from prefect.client.orchestration import get_client
async def main(flow_run_id: str):
async with get_client() as client:
fr = await client.read_flow_run(flow_run_id)
print(fr.infrastructure_pid)
asyncio.run(main("<FLOW_RUN_ID>"))
- How it’s used
- When you cancel a flow run, the worker retrieves infrastructure_pid and uses it to locate and terminate the underlying runtime (process, container, job, etc.). If it’s missing, Prefect will mark the run cancelled but cannot forcibly stop external infrastructure.
- What it looks like (examples; exact format is worker-defined)
- ProcessWorker: typically includes the host and OS process id (e.g., “hostname:pid” or a similar string).
- DockerWorker: container identifier (typically the container ID, optionally qualified by a Docker context/API).
- KubernetesWorker: job identifier (e.g., namespace/name or another job UID string).
- In all cases, it’s a string returned by the worker as the identifier in its result.
- 3.x vs 2.x
- The concept is unchanged: it remains the flow run’s infrastructure identifier for cancellation.
- What changed is ownership: in 3.x, workers and work pools replaced agents/infrastructure blocks; workers still set infrastructure_pid by returning a BaseWorkerResult(identifier=..., status_code=...).
Key references:
- Cancel workflows in Prefect 3.x: Cancel workflows
- Building custom workers (describes the worker result identifier): Developing a custom worker
- Worker base result type (note the identifier field): prefect.workers.base.BaseWorkerResult
- Flow run DB model (field exists as infrastructure_pid): orm_models.FlowRun
- Flow run API responses (includes infrastructure_pid): schemas.responses.FlowRunResponse
If you’re trying to use or interpret a specific value you’re seeing, share the worker type and the value and I can help decode it.Adrien Besnard
10/16/2025, 5:15 PMMarvin
10/16/2025, 5:15 PMMarvin
10/16/2025, 5:18 PMinfrastructure_pid to find and delete the K8s Job/Pod. If the pod isn’t being deleted, it’s usually one of these:
- Missing infrastructure_pid
- If the flow run never captured an identifier (e.g., job name/namespace), Prefect marks the run cancelled but cannot enforce deletion.
- Worker offline or not polling
- Cancellations are enforced by the worker. If it’s down, unhealthy, or pointed at a different work pool, it can’t perform the delete.
- RBAC insufficient
- The worker’s service account needs at least get/list/watch/create/delete on batch.jobs and v1.pods in the target namespace. Without delete, you’ll see 403 errors in worker logs.
- Wrong cluster/namespace/context
- The worker’s kubeconfig/context or work pool namespace doesn’t match where the Job/Pod actually lives. The worker can’t find what to delete.
- Custom job template issues
- If you’ve customized the Job template and it doesn’t create a standard Job that owns the Pod (ownerReferences), deleting the Job may not delete the Pod.
- Policies/finalizers that block deletion
- Finalizers, mutating/validating webhooks, or other cluster policies can leave resources stuck in Terminating. TTL controllers or “keep-on-completion” style policies can delay cleanup.
- Timing/race conditions
- If cancellation happens before the worker sets infrastructure_pid, or the worker can’t immediately reach the API, deletion may not occur or may be delayed.
Quick checks to pinpoint the cause:
- Verify the flow run has an infrastructure identifier
prefect flow-run inspect <FLOW_RUN_ID>
Look for infrastructure_pid. If it’s null/empty, the worker can’t enforce deletion.
- Confirm the worker is up and polling the right pool/namespace; review its logs around the time of cancellation for messages like “Received cancellation request” or any 403/404/connection errors.
- Test RBAC from the worker’s service account (or your context) in the target namespace:
kubectl auth can-i delete job -n <NAMESPACE>
kubectl auth can-i delete pods -n <NAMESPACE>
- Check the actual resource names and namespace match what infrastructure_pid indicates:
kubectl get job -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l job-name=<JOB_NAME>
- Inspect the Job/Pod for finalizers/owner refs:
kubectl get job <JOB_NAME> -n <NAMESPACE> -o yaml
kubectl get pod <POD_NAME> -n <NAMESPACE> -o yaml
- Try manual deletion to isolate RBAC/policy issues:
kubectl delete job <JOB_NAME> -n <NAMESPACE>
Helpful references:
- Prefect 3.x cancellation overview: Cancel workflows
- Worker anatomy and cancellation (BaseWorkerResult identifier): Developing a custom worker and source BaseWorkerResult
- Flow run model (includes infrastructure_pid): FlowRun ORM
- Kubernetes worker/integration docs: prefect-kubernetes integration
If you can share:
- Worker type and versions (prefect and prefect-kubernetes)
- The infrastructure_pid value from prefect flow-run inspect
- Worker logs around the cancellation
- The Job’s namespace/name and a snippet of its YAML (ownerRefs/finalizers)
…I can help pinpoint the exact root cause and fix.