<@ULVA73B9P> Eu uso o prefect 2.19.8 Tô enfrentand...
# ask-marvin
v
@Marvin Eu uso o prefect 2.19.8 Tô enfrentando um problema com alguns deployments, algumas flows acabam crashando com o seguinte erro: Reported flow run 'e2c32ac6-10e7-4dc2-bcc6-84d2242f67ee' as crashed: Flow run infrastructure exited with non-zero status code -1. prefect.flow_runs.worker Job 'load-databricks-20250826113252283070-rr95x': No pods found for job. prefect.flow_runs.worker Job 'load-databricks-20250826113252283070-rr95x': Job reached backoff limit. o Finished Job TTL (Optional) está em 48h e o stream_output está true, então o job se mantém vivo por 48h, mas o pod não, então eu não consigo saber o motivo do crash. Eu uso um cluster EKS. Será que pode ter alguma configuração no eks pra manter o pod vivo após o crash?
m
thought for 563.5 seconds
Valeu pelos detalhes — e por confirmar que está no Prefect 2.19.8. O que os logs indicam - “No pods found for job.”: o worker tentou buscar os Pods do Job (via label
job-name=<nome_do_job>
) para stream de logs, mas não encontrou nenhum. Isso pode acontecer quando: - o Pod foi criado e terminou muito rápido e o GC do Kubernetes já limpou o Pod; - o Job falhou antes mesmo de iniciar um Pod (ex.: erro de admission webhook, quota, spec inválida); - RBAC do worker não permite listar/ler Pods/Logs (mais raro; normalmente veria erro 403). - “Job reached backoff limit.”: o Job bateu no
backoffLimit
do Kubernetes (número de tentativas de Pods falhados) e foi marcado como “Failed”. Por que o Job fica 48h mas o Pod some - O “Finished Job TTL” do Prefect mapeia para
spec.ttlSecondsAfterFinished
no Job. Isso controla quando o Job será removido após terminar, mas não impede outros mecanismos do cluster de remover Pods terminados antes disso. - Em clusters gerenciados como EKS, o kube-controller-manager aplica garbage collection de Pods terminados quando o número ultrapassa um limite interno. Nessa situação, Pods “Succeeded/Failed” antigos podem ser removidos rapidamente mesmo com o Job ainda presente. No EKS, você não consegue ajustar flags do control plane (como
--terminated-pod-gc-threshold
). Como reter melhor evidências/diagnóstico 1) Ajustes no template do Job (work pool) - Garanta que o Job não seja apagado automaticamente (ou seja, TTL alto ou
null
) e aumente a tolerância de falhas e o tempo de encerramento para dar tempo do log escoar:
Copy code
{
  "job": {
    "spec": {
      "ttlSecondsAfterFinished": null,
      "backoffLimit": 6,
      "template": {
        "spec": {
          "restartPolicy": "Never",
          "terminationGracePeriodSeconds": 60,
          "containers": [
            {
              "name": "prefect-job",
              "env": [
                {
                  "name": "PREFECT_LOGGING_LEVEL",
                  "value": "DEBUG"
                }
              ]
            }
          ]
        }
      }
    }
  }
}
- Aplique esse template como “base job template” no seu work pool de Kubernetes:
Copy code
prefect work-pool create "meu-pool-k8s" --type kubernetes --base-job-template ./base-job-template.json --overwrite
Referência: Prefect 2.x – Kubernetes Jobs Observações: -
ttlSecondsAfterFinished: null
evita a exclusão automática pelo TTL. Ainda assim, o GC do cluster pode deletar Pods terminados se o limiar for atingido. -
backoffLimit
maior ajuda a manter mais tentativas (mais Pods) para investigar o padrão de falha, mas também aumenta custo/ruído — use com parcimônia. 2) Garanta logs fora do ciclo de vida do Pod Mesmo com TTL alto, o GC pode remover Pods. A maneira mais confiável de não perder logs é enviá-los para um backend de logging: - Habilite coleta de logs para CloudWatch no EKS (fluent-bit/CloudWatch Logs). - Alternativas: Datadog, ELK, OpenSearch, etc. Isso garante que, mesmo que o Pod seja limpo, os logs permaneçam disponíveis. 3) Checagens rápidas de ambiente - RBAC: o ServiceAccount do worker precisa de
get/list/watch
em Pods e
get
em
pods/log
no namespace. Se houver restrições, o Prefect pode não conseguir localizar/streamar logs. - K8s events: no momento da falha, rode:
Copy code
kubectl describe job <nome>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -n 100
Isso normalmente mostra razões como ImagePullBackOff, OOMKilled, admission denials, limites de quota, etc. - Acompanhe em tempo real quando for reproduzir:
Copy code
kubectl get pods -n <namespace> -l job-name=<nome_do_job> --watch
4) Sobre “manter o Pod vivo” Tecnicamente não há como “manter vivo” um contêiner que já terminou com erro; o que dá para fazer é: - Evitar que o objeto Pod seja deletado rapidamente (TTL alto no Job e evitar overload de Pods terminados no cluster). - Persistir logs fora do Pod (mais recomendado em EKS). Links úteis
- Prefect 2.x – Kubernetes Jobs: docs - Conceitos de infraestrutura no Prefect 2.x: docs - TTL para Jobs no Kubernetes: TTL controller - GC de Pods terminados: Pod garbage collection - EKS control plane gerenciado (não dá para customizar flags do controller-manager): EKS control plane Perguntas para te ajudar melhor - Qual versão do Kubernetes no seu EKS? - O worker do Prefect está rodando dentro do cluster (como Deployment) ou fora (kubeconfig)? E qual ServiceAccount/RoleBinding ele usa? - Vocês já têm logging centralizado (CloudWatch/Fluent Bit, etc.) habilitado no cluster? - Consegue rodar um
kubectl describe job <nome>
logo após a falha para ver os events? Se puder compartilhar um trecho dos events, ajuda muito a identificar a causa raiz.
v
@Marvin esse é meu base job template atual. O que eu devi alterar? { "variables": { "type": "object", "properties": { "env": { "type": "object", "title": "Environment Variables", "default": [ { "name": "CURRENT_POD_ID", "valueFrom": { "fieldRef": { "fieldPath": "metadata.name" } } }, { "name": "PREFECT_API_URL", "value": "https://prefect.bi.tokstok.net.br/api" }, { "name": "PREFECT_API_TLS_INSECURE_SKIP_VERIFY", "value": "True" } ], "description": "Environment variables to set when starting a flow run.", "additionalProperties": { "type": "string" } }, "name": { "type": "string", "title": "Name", "description": "Name given to infrastructure created by a worker." }, "image": { "type": "string", "title": "Image", "default": "prefecthq/prefect:2.19.8-python3.11", "example": "docker.io/prefecthq/prefect:2-latest", "description": "The image reference of a container image to use for created jobs. If not set, the latest Prefect image will be used." }, "labels": { "type": "object", "title": "Labels", "description": "Labels applied to infrastructure created by a worker.", "additionalProperties": { "type": "string" } }, "command": { "type": "string", "title": "Command", "description": "The command to use when starting a flow run. In most cases, this should be left blank and the command will be automatically generated by the worker." }, "cpu_limit": { "type": "string", "title": "CPU Limit", "description": "The CPU allocation to limit for this pod." }, "namespace": { "type": "string", "title": "Namespace", "default": "prod", "description": "The Kubernetes namespace to create jobs within." }, "annotations": { "type": "object", "title": "Annotations", "default": { "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" }, "description": "Annotations applied to jobs created by a worker.", "additionalProperties": { "type": "string" } }, "cpu_request": { "type": "string", "title": "CPU Request", "description": "The CPU allocation to request for this pod." }, "memory_limit": { "type": "string", "title": "Memory Limit", "description": "The Memory allocation to limit for this pod." }, "stream_output": { "type": "boolean", "title": "Stream Output", "default": true, "description": "If set, output will be streamed from the job to local standard output." }, "cluster_config": { "allOf": [ { "$ref": "#/definitions/KubernetesClusterConfig" } ], "title": "Cluster Config", "description": "The Kubernetes cluster config to use for job creation." }, "memory_request": { "type": "string", "title": "Memory Request", "description": "The Memory allocation to request for this pod." }, "finished_job_ttl": { "type": "integer", "title": "Finished Job TTL", "default": 172800, "description": "The number of seconds to retain jobs after completion. If set, finished jobs will be cleaned up by Kubernetes after the given delay. If not set, jobs will be retained indefinitely." }, "image_pull_policy": { "enum": [ "IfNotPresent", "Always", "Never" ], "type": "string", "title": "Image Pull Policy", "default": "Always", "description": "The Kubernetes image pull policy to use for job containers." }, "service_account_name": { "type": "string", "title": "Service Account Name", "default": "prefect-worker", "description": "The Kubernetes service account to use for job creation." }, "job_watch_timeout_seconds": { "type": "integer", "title": "Job Watch Timeout Seconds", "default": 21600, "description": "Number of seconds to wait for each event emitted by a job before timing out. If not set, the worker will wait for each event indefinitely." }, "pod_watch_timeout_seconds": { "type": "integer", "title": "Pod Watch Timeout Seconds", "default": 600, "description": "Number of seconds to watch for pod creation before timing out." } }, "definitions": { "KubernetesClusterConfig": { "type": "object", "title": "KubernetesClusterConfig", "required": [ "config", "context_name" ], "properties": { "config": { "type": "object", "title": "Config", "description": "The entire contents of a kubectl config file." }, "context_name": { "type": "string", "title": "Context Name", "description": "The name of the kubectl context to use." } }, "description": "Stores configuration for interaction with Kubernetes clusters.\n\nSee
from_file
for creation.", "secret_fields": [], "block_type_slug": "kubernetes-cluster-config", "block_schema_references": {} } }, "description": "Default variables for the Kubernetes worker.\n\nThe schema for this class is used to populate the
variables
section of the default\nbase job template." }, "job_configuration": { "env": "{{ env }}", "name": "{{ name }}", "labels": "{{ labels }}", "command": "{{ command }}", "namespace": "{{ namespace }}", "job_manifest": { "kind": "Job", "spec": { "template": { "spec": { "volumes": [ { "name": "volume-to-avoid-eviction", "emptyDir": {} } ], "containers": [ { "env": "{{ env }}", "args": "{{ command }}", "name": "prefect-job", "image": "{{ image }}", "resources": { "limits": { "cpu": "{{ cpu_limit }}", "memory": "{{ memory_limit }}" }, "requests": { "cpu": "{{ cpu_request }}", "memory": "{{ memory_request }}" } }, "volumeMounts": [ { "name": "volume-to-avoid-eviction", "mountPath": "/tmp" } ], "imagePullPolicy": "{{ image_pull_policy }}" } ], "completions": 1, "parallelism": 1, "restartPolicy": "Never", "serviceAccountName": "{{ service_account_name }}" } }, "backoffLimit": 0, "ttlSecondsAfterFinished": "{{ finished_job_ttl }}" }, "metadata": { "labels": "{{ labels }}", "namespace": "{{ namespace }}", "annotations": "{{ annotations }}", "generateName": "{{ name }}-" }, "apiVersion": "batch/v1" }, "stream_output": "{{ stream_output }}", "cluster_config": "{{ cluster_config }}", "job_watch_timeout_seconds": "{{ job_watch_timeout_seconds }}", "pod_watch_timeout_seconds": "{{ pod_watch_timeout_seconds }}" } }
m
Your message was too long, here's your message at the allowed limit: esse é meu base job template atual O que eu devi alterar? { variables { type object properties { env { type object title Environment Variables default [ { name CURRENT_POD_ID valueFrom { fieldRef { fieldPath metadata name } } } { name PREFECT_API_URL value <https //prefect bi tokstok net br/api|https //prefect bi tokstok net br/api> } { name PREFECT_API_TLS_INSECURE_SKIP_VERIFY value True } ] description Environment variables to set when starting a flow run additionalProperties { type string } } name { type string title Name description Name given to infrastructure created by a worker } image { type string title Image default prefecthq/prefect 2 19 8-python3 11 example <http //docker io/prefecthq/prefect 2-latest|docker io/prefecthq/prefect 2-latest> description The image reference of a container image to use for created jobs If not set the latest Prefect image will be used } labels { type object title Labels description Labels applied to infrastructure created by a worker additionalProperties { type string } } command { type string title Command description The command to use when starting a flow run In most cases this should be left blank and the command will be automatically generated by the worker } cpu_limit { type string title CPU Limit description The CPU allocation to limit for this pod } namespace { type string title Namespace default prod description The Kubernetes namespace to create jobs within } annotations { type object title Annotations default { <http //cluster-autoscaler kubernetes io/safe-to-evict|cluster-autoscaler kubernetes io/safe-to-evict> false } description Annotations applied to jobs created by a worker additionalProperties { type string } } cpu_request { type string title CPU Request description The CPU allocation to request for this pod } memory_limit { type string title Memory Limit description The Memory allocation to limit for this pod } stream_output { type boolean title Stream Output default true description If set output will be streamed from the job to local standard output } cluster_config { allOf [ { $ref #/definitions/KubernetesClusterConfig } ] title Cluster Config description The Kubernetes cluster config to use for job creation } memory_request { type string title Memory Request description The Memory allocation to request for this pod } finished_job_ttl { type integer title Finished Job TTL default 172800 description The number of seconds to retain jobs after completion If set finished jobs will be cleaned up by Kubernetes after the given delay If not set jobs will be retained indefinitely } image_pull_policy { enum [ IfNotPresent Always Never ] type string title Image Pull Policy default Always description The Kubernetes image pull policy to use for job containers } service_account_name { type string title Service Account Name default prefect-worker description The Kubernetes service account to use for job creation } job_watch_timeout_seconds { type integer title Job Watch Timeout Seconds default 21600 description Number of seconds to wait for each event emitted by a job before timing out If not set the worker will wait for each event indefinitely } pod_watch_timeout_seconds { type integer title Pod Watch Timeout Seconds default 600