Victor Longui
08/16/2025, 1:14 AMMarvin
08/16/2025, 1:14 AMMarvin
08/16/2025, 1:20 AMspec.ttlSecondsAfterFinished
. When this expires, Kubernetes garbage-collects the Job and its Pod(s).
- Job Watch Timeout Seconds: how long the Prefect worker watches the Job before it gives up watching. Does not delete the Job/Pod.
- Pod Watch Timeout Seconds: how long the worker waits for the Pod to appear/start. Also not a retention setting.
How to keep Pods around longer
1) Increase or disable the Job TTL
- In your work pool (UI): bump “Finished Job TTL” to the retention you want (e.g., 604800 for 7 days). Setting it blank/null removes the TTL entirely so Jobs/Pods remain until you delete them. Use sparingly unless you like clutter.
- Or via CLI:
- Inspect your pool to confirm current template/variables:
prefect work-pool inspect "<pool-name>"
- Get a baseline template, edit, and update:
prefect work-pool get-default-base-job-template --type kubernetes > base.json
# edit base.json: set a large value for finished_job_ttl or remove ttlSecondsAfterFinished
prefect work-pool update "<pool-name>" --base-job-template base.json
2) Verify the TTL on created Jobs (to ensure your change took effect)
kubectl -n <namespace> get job <job-name> -o jsonpath='{.spec.ttlSecondsAfterFinished}{"\n"}'
If you see a small number (or 0), Kubernetes will clean up quickly. If the field is absent, Kubernetes won’t auto-delete the finished Job.
3) Make sure logs are captured even if the Pod disappears
Set stream_output
to true in your pool/deployment so the worker streams container logs to Prefect while the run is active. That way Prefect keeps the logs even if the Pod is later deleted.
Why your Pods might still be vanishing quickly
- A lower TTL is actually applied by your base job template or deployment overrides.
- Something else in the cluster is deleting Jobs/Pods (a cleanup controller or script). If the Job is being deleted, Pods go with it due to owner references.
- Rare/legacy path: if you’re still using the old KubernetesJob infrastructure anywhere, make sure any “delete on completion” option is off. Work pools generally rely on the K8s TTL instead.
Links, because proof beats promises:
- Kubernetes Jobs in Prefect (2.x): https://docs-2.prefect.io/latest/integrations/prefect-kubernetes/jobs/
- Work pools (2.x): https://docs-2.prefect.io/latest/concepts/work-pools/
If you want, share your pool name and a kubectl get job -o yaml
from one of the finished runs and I’ll sanity-check which TTL actually landed.Victor Longui
08/20/2025, 1:17 PMMarvin
08/20/2025, 1:17 PMMarvin
08/20/2025, 1:18 PMkubectl -n <ns> get pods --field-selector=status.phase!=Running | wc -l
- Look for any cleanup controllers in your cluster that delete finished pods.
- Verify the Job still has a TTL set (so you know it’s not the Job deleting itself):
kubectl -n <ns> get job <job-name> -o jsonpath='{.spec.ttlSecondsAfterFinished}{"\n"}'
What to do instead (so you still get logs)
- Stream logs to Prefect while the run is active. Set stream_output=true in your Kubernetes work pool so logs are captured in Prefect even if the Pod later vanishes.
- In the UI: Work Pools > your pool > Variables > set stream_output: true.
- Or via template edit:
prefect work-pool get-default-base-job-template --type kubernetes > base.json
# Edit base.json:
# - keep or increase "finished_job_ttl"
# - set "stream_output": true
prefect work-pool update "<pool-name>" --base-job-template base.json
Reference template shows these knobs:
{
"job_configuration": {
...
"spec": { ... "ttlSecondsAfterFinished": "{{ finished_job_ttl }}" },
...
"stream_output": "{{ stream_output }}"
}
}
- Ship container logs to CloudWatch (EKS best practice) so you can review logs after Pods are gone. If you’re not already using aws-for-fluent-bit/Container Insights, enable it.
For background
- Prefect’s TTL maps to the Kubernetes Job’s ttlSecondsAfterFinished; it governs Job deletion only: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/
- Pod GC is separate and can remove finished Pods regardless of Job TTL: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection
- Prefect 2.x Kubernetes jobs docs: https://docs-2.prefect.io/latest/integrations/prefect-kubernetes/jobs/
- Work pools overview: https://docs-2.prefect.io/latest/concepts/work-pools/
If you share:
- your pool’s variables (especially finished_job_ttl and stream_output), and
- whether you have any cleanup operators running,
I can point at the exact culprit. In the meantime, enabling stream_output and sending logs to CloudWatch will save you from nocturnal Pod disappearances.Victor Longui
08/20/2025, 1:37 PMMarvin
08/20/2025, 1:37 PMMarvin
08/20/2025, 1:38 PMKubernetesJob
object created inside your flow code does not affect that worker job.
- keep_pod_after_completion
isn’t a supported parameter for Prefect 2’s Kubernetes job integration. There’s no Prefect flag that forces Pod retention.
- Even with a large finished_job_ttl
, EKS can garbage‑collect terminated Pods independently of the Job TTL.
What to do instead:
- Capture logs before the Pod disappears
- Enable log streaming in your pool so Prefect stores the container logs:
- In the pool variables set stream_output: true
, or edit your pool’s base job template to include:
{
"job_configuration": {
...,
"stream_output": true
}
}
- Also ship container logs to CloudWatch (aws-for-fluent-bit/Container Insights) so you can inspect them after Pods are gone.
- Keep the Job around (optional, for metadata/diagnostics)
- Keep or increase finished_job_ttl
in the pool (e.g., 604800 for 7 days), or remove ttlSecondsAfterFinished
if you truly want no auto-deletion of Jobs:
prefect work-pool get-default-base-job-template --type kubernetes > base.json
# edit: bump "finished_job_ttl" and ensure spec.ttlSecondsAfterFinished matches
prefect work-pool update "<pool-name>" --base-job-template base.json
- This retains the Job object, but not necessarily the Pod, on EKS.
If you absolutely must keep Pods
- That’s a cluster policy problem, not a Prefect toggle. EKS’s control plane can garbage‑collect terminated Pods; you can’t turn that down from Prefect. The practical approach is to stream/export logs, not to try to preserve failed Pods.
References (because reality is comforting):
- Prefect 2 Kubernetes jobs: https://docs-2.prefect.io/latest/integrations/prefect-kubernetes/jobs/
- Work pools: https://docs-2.prefect.io/latest/concepts/work-pools/
- K8s Job TTL: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/
- Pod GC: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection
If you share your pool’s current variables (especially stream_output
and finished_job_ttl
) I’ll give you the minimal JSON patch to make logs stick around in Prefect.