Tom Klein
04/13/2022, 3:31 PMRunNamespacedJob
example (from https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows_task_library/s3_kubernetes_run_RunNamespacedJob_and_get_logs.py ) --- we implemented it and got it to work, but it seems that itās now failing on :
VALIDATIONFAIL signal raised: VALIDATIONFAIL('More than one dummy pod')
because there seems to be many pod āresiudesā of previous runs:
['prefect-agent-7745fb9694-6fwk4', 'prefect-job-47d072a8-4pbsf', 'seg-pred-test-cm54l', 'seg-pred-test-doron', 'seg-pred-test-l2j5l', 'seg-pred-test-zvwld']
so wouldnāt k8s keep the pods around given that we gave a ādelete_job_after_completionā = False ? and even if the job is deleted successfully, wouldnāt it keep the pods around? or are the pods supposed to be deleted automatically if the job is deletedā¦?Kevin Kho
04/13/2022, 3:48 PMAnna Geller
04/13/2022, 6:01 PMdelete_job_after_completion
is for the Kubernetes job, not for a pod. A single job can result in many pods afaik
In general, it's all configurable, you need to dig deeper into those Kubernetes tasks, Tom š but happy to help you if you have trouble understanding those.
Can you share the flow code for those Kubernetes tasks that seem confusing to you?DeleteNamespacedPod
to clean those up
What's the end goal you try to achieve here? Do you want to keep those pods or delete those?Tom Klein
04/13/2022, 9:35 PMAnna Geller
04/14/2022, 12:30 AMTom Klein
04/14/2022, 10:58 AMdelete
, you will find yourself in the current state we are in, right? where you have multiple pods that have not been deleted, and the validation (that checks if there are other pods that start with the same prefix) would fail if you try to run it again ⦠right?
iām basically wondering whatās the right way to cope with this situation - we canāt just omit the validation (since then the logic of returning the āfirstā pod that is found would make no sense, no one guarantees itās necessarily the one that was just now created) - and i couldnāt really find other ways to get the pod-name for the job that was just created (itās not returned by the ReadNamespacedJob
task, for example, nor by the RunNamespacedJob
itself)Anna Geller
04/14/2022, 11:03 AMRunNamespacedJob
fails, the DeleteNamespacedJob
doesn't run, leaving a zombie pod undeleted, correct?Tom Klein
04/14/2022, 11:06 AMReadNamespacedJob
(in your example), thereās a ListNamespacedPod
to list all the pods and then filter for the ones that have a name starting with the name of our job --
we tried the ReadNamespacedJob
as an alternative way of maybe getting the pod name directly⦠š
but yes - what you wrote is correct and seems to be what actually happened - for whatever reason (doesnāt even really matter why) - there was some initial zombie pod, after which the process didnāt stop generating more of them (cause each time - even though it begins with a ādelete_if_existsā for the job - that doesnāt actually remove the pod. Iām not a k8s expert but it seems to be possible for the job to not exist anymore even though the pod does. - and when the validation fails, the pod that was just created for the newly created job --- still existsDeleteNamespacedJob
) fail since it doesnāt exist
⢠create and run new job
⢠list pods
⢠fail validation since now there are two pods with that name
⢠end delete step (of DeleteNamespacedPod
) is not reached so a new zombie is created --- and even if it was reached, it would only take care of this current run, not of the other zombie --- but it canāt even do that since itās unclear which of the two pods is āourā pod (this runās pod, that is)
and now the process begins again with 2 zombies instead of 1, and so on..
maybe we need like an initial cleanup step that also tries to delete all zombie pods with that name? my worry is that itās kind of a low-level fiddling with other runs⦠what if someone legitimately ran this flow more than once in parallel?
all i want is to have a single job act as an āatomicā operation (that cleans only after itself) ā i donāt mind if there are multiple such jobs running simultaneously etc.Anna Geller
04/14/2022, 11:11 AMRunNamespacedJob
what you wrote is correct and seems to be what actually happenedyou can solve it using triggers - adding
trigger=all_finished
, should ensure that the pod will get deleted even if RunNamespacedJob
fails - add the same line to the delete taskTom Klein
04/14/2022, 11:13 AMAnna Geller
04/14/2022, 11:14 AMTom Klein
04/14/2022, 11:14 AMAnna Geller
04/14/2022, 11:15 AMTom Klein
04/14/2022, 11:16 AMseg-pred-123
and seg-pred-456
as pods when you start to run. which one do you delete?RunNamespacedJob
- without relying on there being only a single pod with that name prefix.
in kubectl
this is achieved with describe job
apparently, or something. doesnāt seem possible via Prefect (or maybe it is and iām missing something. thatās what iām asking)Anna Geller
04/14/2022, 11:30 AMif we get to that line and there were two pods with that name to begin with, we wouldnāt know which one to delete, right?
all_finished
is the most reliable and cleanest approach I can recommend at this timeTom Klein
04/14/2022, 11:32 AMall_finished
will make sure the step runs, but if thereās more than one pod with that name we wouldnāt know which of the two (or three, or four) is āourā pod (the one that was generated by this specific prefect flow run)
we definitely want to allow for more than one instance of the job to run in parallel.
maybe the solution is to give the job a unique name per run⦠? (that way thereās a 1:1 relation between ājobs with that nameā and ācorresponding podsā )
the case of multiple pods can definitely happen if more than one person runs an instance of this flowAnna Geller
04/14/2022, 11:34 AMif thereās more than one pod with that namewhy would it be?
Tom Klein
04/14/2022, 11:34 AMAnna Geller
04/14/2022, 11:35 AMTom Klein
04/14/2022, 11:35 AMmy-cool-job-467fdfg5a
) solves that problem (since there would only ever be one pod that matches that unique name), i just donāt know if thatās a ābest practiceā. Seems to me like itās more of a workaround.Anna Geller
04/14/2022, 11:38 AMTom Klein
04/14/2022, 11:41 AMjob
that i donāt know if we want to give a unique name toā¦
my knowledge of k8s is too limited to know if thereās any important reasons to want jobs of a similar ānatureā (i.e. image, code, whatever) to have the same name⦠iāll ask our devops
anyway, it all would have been solved if we somehow could just get the name of the pod related to the job we just created via RunNamespacedJob
ā i tried to use the Read
task and the only thing that looked like an identifier was the controller-uid
or something - but i dunno if that should/could be used as an indirect identifier for the pod
{'api_version': 'batch/v1',
'kind': 'Job',
'metadata': {'annotations': None,
'cluster_name': None,
'creation_timestamp': datetime.datetime(2022, 4, 13, 16, 1, 56, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': None,
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'controller-uid': '35ca5ffe-8583-42bb-8c98-ec1d413bf7cc',
'job-name': 'seg-pred-test'},
'managed_fields': [{'api_version': 'batch/v1',
'fields_type': 'FieldsV1',
alternatively, we would need to give up our desire to āpull backā the logs into Prefect UI and just rely on our own internal logging mechanism
(and then we can set the job to delete resources after its finished)..
this is what iām trying right now:
#del_job = delete_if_exists()
k8s_job = create_and_run_job()
#del_job.set_downstream(k8s_job)
v1job = read_job()
k8s_job.set_downstream(v1job)
print_job_output = print_job(v1job)
controller_uid = v1job['metadata']['labels']['controller-uid']
pods = list_pods(kube_kwargs={"label_selector": {"controller-uid": controller_uid}})
list_of_pods = get_pod_ids(pods)
pod_name = get_our_pod_name(pods)
delete_pod(pod_name)
Anna Geller
04/14/2022, 12:04 PMMarvin
04/14/2022, 12:05 PMTom Klein
04/14/2022, 12:06 PM