Marko Jamedzija
07/20/2021, 3:33 PMRunning
state even though the underlying k8s RunNamespacedJob
task completed successfully. I’m using prefect 0.15.1
and LocalDaskExecutor
. This happens almost always for the longer-running tasks of this kind. Any suggestion how to resolve this? Thanks!Kevin Kho
07/20/2021, 3:34 PMLocalDaskExecutor
, I’ve seen some people get around this by using processes instead of threads. Are you using processes already?Marko Jamedzija
07/20/2021, 3:40 PMthreads
. I tried using processes
and indeed this didn’t happen, but still evaluating how much is it gonna affect the resource usage. What is your advice here for setting num_workers
more then number of cores?RunNamespacedJob
tasks, which should be pretty lightweightKevin Kho
07/20/2021, 3:46 PMnum_workers
is more than n_cores
with processes because the resources are already exhausted. If the RunNamespacedJob
is lightweight, the worker should just pick up the next task once that is kicked off.Marko Jamedzija
07/20/2021, 3:48 PMKuberenetesRun
with cpu_limit=2
and used LocalDaskExecutor(scheduler="processes", num_workers=4)
. It did run 4 RunNamespacedJob
tasks in parallel successfully, but again got stuck in running the longest one. Do you have any other suggestions how to deal w/ this? Thanks!Kevin Kho
07/20/2021, 4:13 PMMarko Jamedzija
07/20/2021, 4:15 PMKevin Kho
07/20/2021, 4:19 PMcpu_limit
or reducing num_workers
so that they are 1:1?Marko Jamedzija
07/20/2021, 4:21 PMKevin Kho
07/20/2021, 4:22 PMMarko Jamedzija
07/20/2021, 4:36 PMreducingÂThis works. I reduced to 2 workers and it’s working. However, I still think this is an issue that needs fixing. I’ll inspect tomorrow more the resource usage in the cluster to be sure, but from what I saw so far it shouldn’t have been the reason for this behaviour 🙂 so that they are 1:1num_workers
Kevin Kho
07/20/2021, 4:38 PMMarko Jamedzija
07/21/2021, 2:42 PMcpu_limit
is just used to stop the job if it “outgrows” this resource requirement, and the LocalDaskExecutor
will use the pod available cpu count (which is independent of this value) to create the number of processes (unless it’s overridden w/ num_workers
). From what I managed to conclude it’s the number of cores of underlying node. So for now I will just use nodes with higher CPU count to achieve higher parallelism, but the problem of num_workers
> CPUs remains 🙂 Thanks for the help in any case!