Hello slightly smiling face I m experiencing a problem with Prefect Community #prefect-server

Hello :slightly_smiling_face: I’m experiencing a p...

Marko Jamedzija

07/20/2021, 3:33 PM

Hello 🙂 I’m experiencing a problem with some tasks getting stuck in the

Running

state even though the underlying k8s

RunNamespacedJob

task completed successfully. I’m using prefect

0.15.1

and

LocalDaskExecutor

. This happens almost always for the longer-running tasks of this kind. Any suggestion how to resolve this? Thanks!

Kevin Kho

07/20/2021, 3:34 PM

Hey @Marko Jamedzija, for the

LocalDaskExecutor

, I’ve seen some people get around this by using processes instead of threads. Are you using processes already?

Marko Jamedzija

07/20/2021, 3:40 PM

Thanks @Kevin Kho! This happened with using

threads

. I tried using

processes

and indeed this didn’t happen, but still evaluating how much is it gonna affect the resource usage. What is your advice here for setting

num_workers

more then number of cores?

Marko Jamedzija

07/20/2021, 3:42 PM

I’m only running

RunNamespacedJob

tasks, which should be pretty lightweight

Kevin Kho

07/20/2021, 3:46 PM

I wouldn’t know as much so don’t believe me, but I assume you don’t get any gains if

num_workers

is more than

n_cores

with processes because the resources are already exhausted. If the

RunNamespacedJob

is lightweight, the worker should just pick up the next task once that is kicked off.

Marko Jamedzija

07/20/2021, 3:48 PM

Thanks for the explanation Kevin! I’ll do some more tests to see 🙂

Marko Jamedzija

07/20/2021, 4:11 PM

So, I started a flow on k8s using

KuberenetesRun

with

cpu_limit=2

and used

LocalDaskExecutor(scheduler="processes", num_workers=4)

. It did run 4

RunNamespacedJob

tasks in parallel successfully, but again got stuck in running the longest one. Do you have any other suggestions how to deal w/ this? Thanks!

Kevin Kho

07/20/2021, 4:13 PM

So I think this can also happen when the Kubernetes pods are unable to get resources. Do you think that the 4 parallel jobs have enough resources to execute in parallel?

Marko Jamedzija

07/20/2021, 4:15 PM

They do. All of them end. Even the one whose task is problematic ends (its k8s job gets executed and deleted). I can confirm this because they are executing SQL queries and I see them completed in the db.

Kevin Kho

07/20/2021, 4:19 PM

Gotcha. At this point, I think the only thing to do is to try bumping up

cpu_limit

or reducing

num_workers

so that they are 1:1?

Marko Jamedzija

07/20/2021, 4:21 PM

I’m trying to do that now 🙂 But even if it works, it’s a bit of resource overkill imo to use one core per task, when all it has to do is to start a k8s job and monitor until it’s completed.

Kevin Kho

07/20/2021, 4:22 PM

I agree but I think something is potentially going off with the monitoring.

Marko Jamedzija

07/20/2021, 4:36 PM

reducing
num_workers
so that they are 1:1

This works. I reduced to 2 workers and it’s working. However, I still think this is an issue that needs fixing. I’ll inspect tomorrow more the resource usage in the cluster to be sure, but from what I saw so far it shouldn’t have been the reason for this behaviour 🙂

👍 1

Kevin Kho

07/20/2021, 4:38 PM

Sounds good, at least we have something functioning for now

🙂 1

Marko Jamedzija

07/21/2021, 2:42 PM

Hey Kevin, just to let you know that I inspected the resources and there was no problem there. My flow’s k8s job was using at most 5% of the CPU and there was enough resources for everything to run in the cluster. Also,

cpu_limit

is just used to stop the job if it “outgrows” this resource requirement, and the

LocalDaskExecutor

will use the pod available cpu count (which is independent of this value) to create the number of processes (unless it’s overridden w/

num_workers

). From what I managed to conclude it’s the number of cores of underlying node. So for now I will just use nodes with higher CPU count to achieve higher parallelism, but the problem of

num_workers

> CPUs remains 🙂 Thanks for the help in any case!

👍 1

Open in Slack

Previous Next