Hello :slightly_smiling_face: I’m experiencing a p...
# prefect-server
m
Hello 🙂 I’m experiencing a problem with some tasks getting stuck in the
Running
state even though the underlying k8s
RunNamespacedJob
task completed successfully. I’m using prefect
0.15.1
and
LocalDaskExecutor
. This happens almost always for the longer-running tasks of this kind. Any suggestion how to resolve this? Thanks!
k
Hey @Marko Jamedzija, for the
LocalDaskExecutor
, I’ve seen some people get around this by using processes instead of threads. Are you using processes already?
m
Thanks @Kevin Kho! This happened with using
threads
. I tried using
processes
and indeed this didn’t happen, but still evaluating how much is it gonna affect the resource usage. What is your advice here for setting
num_workers
more then number of cores?
I’m only running
RunNamespacedJob
tasks, which should be pretty lightweight
k
I wouldn’t know as much so don’t believe me, but I assume you don’t get any gains if
num_workers
is more than
n_cores
with processes because the resources are already exhausted. If the
RunNamespacedJob
is lightweight, the worker should just pick up the next task once that is kicked off.
m
Thanks for the explanation Kevin! I’ll do some more tests to see 🙂
So, I started a flow on k8s using
KuberenetesRun
with
cpu_limit=2
and used
LocalDaskExecutor(scheduler="processes", num_workers=4)
. It did run 4
RunNamespacedJob
tasks in parallel successfully, but again got stuck in running the longest one. Do you have any other suggestions how to deal w/ this? Thanks!
k
So I think this can also happen when the Kubernetes pods are unable to get resources. Do you think that the 4 parallel jobs have enough resources to execute in parallel?
m
They do. All of them end. Even the one whose task is problematic ends (its k8s job gets executed and deleted). I can confirm this because they are executing SQL queries and I see them completed in the db.
k
Gotcha. At this point, I think the only thing to do is to try bumping up
cpu_limit
or reducing
num_workers
so that they are 1:1?
m
I’m trying to do that now 🙂 But even if it works, it’s a bit of resource overkill imo to use one core per task, when all it has to do is to start a k8s job and monitor until it’s completed.
k
I agree but I think something is potentially going off with the monitoring.
m
reducing 
num_workers
 so that they are 1:1
This works. I reduced to 2 workers and it’s working. However, I still think this is an issue that needs fixing. I’ll inspect tomorrow more the resource usage in the cluster to be sure, but from what I saw so far it shouldn’t have been the reason for this behaviour 🙂
👍 1
k
Sounds good, at least we have something functioning for now
🙂 1
m
Hey Kevin, just to let you know that I inspected the resources and there was no problem there. My flow’s k8s job was using at most 5% of the CPU and there was enough resources for everything to run in the cluster. Also,
cpu_limit
is just used to stop the job if it “outgrows” this resource requirement, and the
LocalDaskExecutor
will use the pod available cpu count (which is independent of this value) to create the number of processes (unless it’s overridden w/
num_workers
). From what I managed to conclude it’s the number of cores of underlying node. So for now I will just use nodes with higher CPU count to achieve higher parallelism, but the problem of
num_workers
> CPUs remains 🙂 Thanks for the help in any case!
👍 1