< Marvin> I have a deployment where some flows crash during Prefect Community #ask-marvin

<@ULVA73B9P> I have a deployment where some flows ...

Victor Longui

08/16/2025, 1:14 AM

@Marvin I have a deployment where some flows crash during the night. The problem is that when I check in the morning, the POD created during the flow execution has already been deleted, which prevents me from understanding the reason for the crash. How can I extend the time the POD remains available so I can review the logs? I'm running Prefect 2.19 In the work pool settings, I noticed the following options: Finished Job TTL: 172800 Job Watch Timeout Seconds: 21600 Pod Watch Timeout Seconds: 600

Marvin

08/16/2025, 1:14 AM

✅ thought for 356.4 seconds

Marvin

08/16/2025, 1:20 AM

Short version: the two “watch timeout” settings don’t control retention. The knob you care about is the Kubernetes Job TTL. What those settings mean (Prefect 2.19, Kubernetes work pool): - Finished Job TTL: mapped to the Job’s

spec.ttlSecondsAfterFinished

. When this expires, Kubernetes garbage-collects the Job and its Pod(s). - Job Watch Timeout Seconds: how long the Prefect worker watches the Job before it gives up watching. Does not delete the Job/Pod. - Pod Watch Timeout Seconds: how long the worker waits for the Pod to appear/start. Also not a retention setting. How to keep Pods around longer 1) Increase or disable the Job TTL - In your work pool (UI): bump “Finished Job TTL” to the retention you want (e.g., 604800 for 7 days). Setting it blank/null removes the TTL entirely so Jobs/Pods remain until you delete them. Use sparingly unless you like clutter. - Or via CLI: - Inspect your pool to confirm current template/variables:

Copy code

prefect work-pool inspect "<pool-name>"

- Get a baseline template, edit, and update:

Copy code

prefect work-pool get-default-base-job-template --type kubernetes > base.json
    # edit base.json: set a large value for finished_job_ttl or remove ttlSecondsAfterFinished
    prefect work-pool update "<pool-name>" --base-job-template base.json

2) Verify the TTL on created Jobs (to ensure your change took effect)

Copy code

kubectl -n <namespace> get job <job-name> -o jsonpath='{.spec.ttlSecondsAfterFinished}{"\n"}'

If you see a small number (or 0), Kubernetes will clean up quickly. If the field is absent, Kubernetes won’t auto-delete the finished Job. 3) Make sure logs are captured even if the Pod disappears Set

stream_output

to true in your pool/deployment so the worker streams container logs to Prefect while the run is active. That way Prefect keeps the logs even if the Pod is later deleted. Why your Pods might still be vanishing quickly - A lower TTL is actually applied by your base job template or deployment overrides. - Something else in the cluster is deleting Jobs/Pods (a cleanup controller or script). If the Job is being deleted, Pods go with it due to owner references. - Rare/legacy path: if you’re still using the old KubernetesJob infrastructure anywhere, make sure any “delete on completion” option is off. Work pools generally rely on the K8s TTL instead. Links, because proof beats promises: - Kubernetes Jobs in Prefect (2.x): https://docs-2.prefect.io/latest/integrations/prefect-kubernetes/jobs/ - Work pools (2.x): https://docs-2.prefect.io/latest/concepts/work-pools/ If you want, share your pool name and a

kubectl get job -o yaml

from one of the finished runs and I’ll sanity-check which TTL actually landed.

Victor Longui

08/20/2025, 1:17 PM

@Marvin The "Finished Job TTL" option was at 172800, that is, the job would stay alive for 48 hours, but this does not cognize that the POD will also stay alive, only the job. I did the tests and got a flow that failed, the job continues, but the POD does not. Can it be some configuration in my EKS cluster?

Marvin

08/20/2025, 1:17 PM

✅ thought for 82.0 seconds

Marvin

08/20/2025, 1:18 PM

You’re right: Finished Job TTL only governs the Job object. It does not guarantee Pod retention. On many clusters (including EKS), terminated Pods (Succeeded/Failed) are garbage-collected independently of Job TTL. What’s likely happening - Kubernetes’ terminated Pod GC is deleting finished Pods when a threshold is exceeded. This is controlled by the kube-controller-manager (flag: terminated-pod-gc-threshold). On EKS’s managed control plane you can’t tune this flag, so finished Pods can disappear while the Job remains. - You might also have a cleanup operator in the cluster (janitor, kube-cleanup-operator, etc.) pruning finished Pods. How to confirm - Check how many terminated pods you have in the namespace; large counts trigger GC:

Copy code

kubectl -n <ns> get pods --field-selector=status.phase!=Running | wc -l

- Look for any cleanup controllers in your cluster that delete finished pods. - Verify the Job still has a TTL set (so you know it’s not the Job deleting itself):

Copy code

kubectl -n <ns> get job <job-name> -o jsonpath='{.spec.ttlSecondsAfterFinished}{"\n"}'

What to do instead (so you still get logs) - Stream logs to Prefect while the run is active. Set stream_output=true in your Kubernetes work pool so logs are captured in Prefect even if the Pod later vanishes. - In the UI: Work Pools > your pool > Variables > set stream_output: true. - Or via template edit:

Copy code

prefect work-pool get-default-base-job-template --type kubernetes > base.json
    # Edit base.json:
    # - keep or increase "finished_job_ttl"
    # - set "stream_output": true
    prefect work-pool update "<pool-name>" --base-job-template base.json

Reference template shows these knobs:

Copy code

{
    "job_configuration": {
      ...
      "spec": { ... "ttlSecondsAfterFinished": "{{ finished_job_ttl }}" },
      ...
      "stream_output": "{{ stream_output }}"
    }
  }

- Ship container logs to CloudWatch (EKS best practice) so you can review logs after Pods are gone. If you’re not already using aws-for-fluent-bit/Container Insights, enable it. For background - Prefect’s TTL maps to the Kubernetes Job’s ttlSecondsAfterFinished; it governs Job deletion only: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ - Pod GC is separate and can remove finished Pods regardless of Job TTL: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection - Prefect 2.x Kubernetes jobs docs: https://docs-2.prefect.io/latest/integrations/prefect-kubernetes/jobs/ - Work pools overview: https://docs-2.prefect.io/latest/concepts/work-pools/ If you share: - your pool’s variables (especially finished_job_ttl and stream_output), and - whether you have any cleanup operators running, I can point at the exact culprit. In the meantime, enabling stream_output and sending logs to CloudWatch will save you from nocturnal Pod disappearances.

Victor Longui

08/20/2025, 1:37 PM

@Marvin Is it work to keep my pods alive after crash? from prefect import flow from prefect.kubernetes import KubernetesJob @ flow def my_flow(): job = KubernetesJob( finished_job_ttl=172800, # 48 horas em segundos keep_pod_after_completion=True, # Manter o pod após conclusão pod_spec_override={ "spec": { "restartPolicy": "Never", "containers": [{ "name": "my-container", # suas configurações }] } } ) # seu código aqui

Marvin

08/20/2025, 1:37 PM

✅ thought for 45.1 seconds

Marvin

08/20/2025, 1:38 PM

Short answer: no—that snippet won’t keep Pods alive, and in your setup it likely won’t do anything at all. Why: - If you’re running via a Kubernetes work pool (which you are), the worker creates the Job/Pod from the pool’s base job template and variables. A

KubernetesJob

object created inside your flow code does not affect that worker job. -

keep_pod_after_completion

isn’t a supported parameter for Prefect 2’s Kubernetes job integration. There’s no Prefect flag that forces Pod retention. - Even with a large

finished_job_ttl

, EKS can garbage‑collect terminated Pods independently of the Job TTL. What to do instead: - Capture logs before the Pod disappears - Enable log streaming in your pool so Prefect stores the container logs: - In the pool variables set

stream_output: true

, or edit your pool’s base job template to include:

Copy code

{
        "job_configuration": {
          ...,
          "stream_output": true
        }
      }

- Also ship container logs to CloudWatch (aws-for-fluent-bit/Container Insights) so you can inspect them after Pods are gone. - Keep the Job around (optional, for metadata/diagnostics) - Keep or increase

finished_job_ttl

in the pool (e.g., 604800 for 7 days), or remove

ttlSecondsAfterFinished

if you truly want no auto-deletion of Jobs:

Copy code

prefect work-pool get-default-base-job-template --type kubernetes > base.json
    # edit: bump "finished_job_ttl" and ensure spec.ttlSecondsAfterFinished matches
    prefect work-pool update "<pool-name>" --base-job-template base.json

- This retains the Job object, but not necessarily the Pod, on EKS. If you absolutely must keep Pods - That’s a cluster policy problem, not a Prefect toggle. EKS’s control plane can garbage‑collect terminated Pods; you can’t turn that down from Prefect. The practical approach is to stream/export logs, not to try to preserve failed Pods. References (because reality is comforting): - Prefect 2 Kubernetes jobs: https://docs-2.prefect.io/latest/integrations/prefect-kubernetes/jobs/ - Work pools: https://docs-2.prefect.io/latest/concepts/work-pools/ - K8s Job TTL: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ - Pod GC: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection If you share your pool’s current variables (especially

stream_output

and

finished_job_ttl

) I’ll give you the minimal JSON patch to make logs stick around in Prefect.

4 Views

Open in Slack

Previous Next