https://prefect.io logo
Title
m

Marko Jamedzija

07/20/2021, 3:33 PM
Hello 🙂 I’m experiencing a problem with some tasks getting stuck in the
Running
state even though the underlying k8s
RunNamespacedJob
task completed successfully. I’m using prefect
0.15.1
and
LocalDaskExecutor
. This happens almost always for the longer-running tasks of this kind. Any suggestion how to resolve this? Thanks!
k

Kevin Kho

07/20/2021, 3:34 PM
Hey @Marko Jamedzija, for the
LocalDaskExecutor
, I’ve seen some people get around this by using processes instead of threads. Are you using processes already?
m

Marko Jamedzija

07/20/2021, 3:40 PM
Thanks @Kevin Kho! This happened with using
threads
. I tried using
processes
and indeed this didn’t happen, but still evaluating how much is it gonna affect the resource usage. What is your advice here for setting
num_workers
more then number of cores?
I’m only running
RunNamespacedJob
tasks, which should be pretty lightweight
k

Kevin Kho

07/20/2021, 3:46 PM
I wouldn’t know as much so don’t believe me, but I assume you don’t get any gains if
num_workers
is more than
n_cores
with processes because the resources are already exhausted. If the
RunNamespacedJob
is lightweight, the worker should just pick up the next task once that is kicked off.
m

Marko Jamedzija

07/20/2021, 3:48 PM
Thanks for the explanation Kevin! I’ll do some more tests to see 🙂
So, I started a flow on k8s using
KuberenetesRun
with
cpu_limit=2
and used
LocalDaskExecutor(scheduler="processes", num_workers=4)
. It did run 4
RunNamespacedJob
tasks in parallel successfully, but again got stuck in running the longest one. Do you have any other suggestions how to deal w/ this? Thanks!
k

Kevin Kho

07/20/2021, 4:13 PM
So I think this can also happen when the Kubernetes pods are unable to get resources. Do you think that the 4 parallel jobs have enough resources to execute in parallel?
m

Marko Jamedzija

07/20/2021, 4:15 PM
They do. All of them end. Even the one whose task is problematic ends (its k8s job gets executed and deleted). I can confirm this because they are executing SQL queries and I see them completed in the db.
k

Kevin Kho

07/20/2021, 4:19 PM
Gotcha. At this point, I think the only thing to do is to try bumping up
cpu_limit
or reducing
num_workers
so that they are 1:1?
m

Marko Jamedzija

07/20/2021, 4:21 PM
I’m trying to do that now 🙂 But even if it works, it’s a bit of resource overkill imo to use one core per task, when all it has to do is to start a k8s job and monitor until it’s completed.
k

Kevin Kho

07/20/2021, 4:22 PM
I agree but I think something is potentially going off with the monitoring.
m

Marko Jamedzija

07/20/2021, 4:36 PM
reducing 
num_workers
 so that they are 1:1
This works. I reduced to 2 workers and it’s working. However, I still think this is an issue that needs fixing. I’ll inspect tomorrow more the resource usage in the cluster to be sure, but from what I saw so far it shouldn’t have been the reason for this behaviour 🙂
👍 1
k

Kevin Kho

07/20/2021, 4:38 PM
Sounds good, at least we have something functioning for now
🙂 1
m

Marko Jamedzija

07/21/2021, 2:42 PM
Hey Kevin, just to let you know that I inspected the resources and there was no problem there. My flow’s k8s job was using at most 5% of the CPU and there was enough resources for everything to run in the cluster. Also,
cpu_limit
is just used to stop the job if it “outgrows” this resource requirement, and the
LocalDaskExecutor
will use the pod available cpu count (which is independent of this value) to create the number of processes (unless it’s overridden w/
num_workers
). From what I managed to conclude it’s the number of cores of underlying node. So for now I will just use nodes with higher CPU count to achieve higher parallelism, but the problem of
num_workers
> CPUs remains 🙂 Thanks for the help in any case!
👍 1