Marko Jamedzija
07/20/2021, 3:33 PMRunning state even though the underlying k8s RunNamespacedJob task completed successfully. I’m using prefect 0.15.1 and LocalDaskExecutor. This happens almost always for the longer-running tasks of this kind. Any suggestion how to resolve this? Thanks!Kevin Kho
LocalDaskExecutor , I’ve seen some people get around this by using processes instead of threads. Are you using processes already?Marko Jamedzija
07/20/2021, 3:40 PMthreads. I tried using processes and indeed this didn’t happen, but still evaluating how much is it gonna affect the resource usage. What is your advice here for setting num_workers more then number of cores?Marko Jamedzija
07/20/2021, 3:42 PMRunNamespacedJob tasks, which should be pretty lightweightKevin Kho
num_workers is more than n_cores with processes because the resources are already exhausted. If the RunNamespacedJob is lightweight, the worker should just pick up the next task once that is kicked off.Marko Jamedzija
07/20/2021, 3:48 PMMarko Jamedzija
07/20/2021, 4:11 PMKuberenetesRun with cpu_limit=2 and used LocalDaskExecutor(scheduler="processes", num_workers=4). It did run 4 RunNamespacedJob tasks in parallel successfully, but again got stuck in running the longest one. Do you have any other suggestions how to deal w/ this? Thanks!Kevin Kho
Marko Jamedzija
07/20/2021, 4:15 PMKevin Kho
cpu_limit or reducing num_workers so that they are 1:1?Marko Jamedzija
07/20/2021, 4:21 PMKevin Kho
Marko Jamedzija
07/20/2021, 4:36 PMreducingThis works. I reduced to 2 workers and it’s working. However, I still think this is an issue that needs fixing. I’ll inspect tomorrow more the resource usage in the cluster to be sure, but from what I saw so far it shouldn’t have been the reason for this behaviour 🙂so that they are 1:1num_workers
Kevin Kho
Marko Jamedzija
07/21/2021, 2:42 PMcpu_limit is just used to stop the job if it “outgrows” this resource requirement, and the LocalDaskExecutor will use the pod available cpu count (which is independent of this value) to create the number of processes (unless it’s overridden w/ num_workers). From what I managed to conclude it’s the number of cores of underlying node. So for now I will just use nodes with higher CPU count to achieve higher parallelism, but the problem of num_workers > CPUs remains 🙂 Thanks for the help in any case!