Thread
#prefect-server
    m

    Marko Jamedzija

    1 year ago
    Hello 🙂 I’m experiencing a problem with some tasks getting stuck in the
    Running
    state even though the underlying k8s
    RunNamespacedJob
    task completed successfully. I’m using prefect
    0.15.1
    and
    LocalDaskExecutor
    . This happens almost always for the longer-running tasks of this kind. Any suggestion how to resolve this? Thanks!
    Kevin Kho

    Kevin Kho

    1 year ago
    Hey @Marko Jamedzija, for the
    LocalDaskExecutor
    , I’ve seen some people get around this by using processes instead of threads. Are you using processes already?
    m

    Marko Jamedzija

    1 year ago
    Thanks @Kevin Kho! This happened with using
    threads
    . I tried using
    processes
    and indeed this didn’t happen, but still evaluating how much is it gonna affect the resource usage. What is your advice here for setting
    num_workers
    more then number of cores?
    I’m only running
    RunNamespacedJob
    tasks, which should be pretty lightweight
    Kevin Kho

    Kevin Kho

    1 year ago
    I wouldn’t know as much so don’t believe me, but I assume you don’t get any gains if
    num_workers
    is more than
    n_cores
    with processes because the resources are already exhausted. If the
    RunNamespacedJob
    is lightweight, the worker should just pick up the next task once that is kicked off.
    m

    Marko Jamedzija

    1 year ago
    Thanks for the explanation Kevin! I’ll do some more tests to see 🙂
    So, I started a flow on k8s using
    KuberenetesRun
    with
    cpu_limit=2
    and used
    LocalDaskExecutor(scheduler="processes", num_workers=4)
    . It did run 4
    RunNamespacedJob
    tasks in parallel successfully, but again got stuck in running the longest one. Do you have any other suggestions how to deal w/ this? Thanks!
    Kevin Kho

    Kevin Kho

    1 year ago
    So I think this can also happen when the Kubernetes pods are unable to get resources. Do you think that the 4 parallel jobs have enough resources to execute in parallel?
    m

    Marko Jamedzija

    1 year ago
    They do. All of them end. Even the one whose task is problematic ends (its k8s job gets executed and deleted). I can confirm this because they are executing SQL queries and I see them completed in the db.
    Kevin Kho

    Kevin Kho

    1 year ago
    Gotcha. At this point, I think the only thing to do is to try bumping up
    cpu_limit
    or reducing
    num_workers
    so that they are 1:1?
    m

    Marko Jamedzija

    1 year ago
    I’m trying to do that now 🙂 But even if it works, it’s a bit of resource overkill imo to use one core per task, when all it has to do is to start a k8s job and monitor until it’s completed.
    Kevin Kho

    Kevin Kho

    1 year ago
    I agree but I think something is potentially going off with the monitoring.
    m

    Marko Jamedzija

    1 year ago
    reducing 
    num_workers
     so that they are 1:1
    This works. I reduced to 2 workers and it’s working. However, I still think this is an issue that needs fixing. I’ll inspect tomorrow more the resource usage in the cluster to be sure, but from what I saw so far it shouldn’t have been the reason for this behaviour 🙂
    Kevin Kho

    Kevin Kho

    1 year ago
    Sounds good, at least we have something functioning for now
    m

    Marko Jamedzija

    1 year ago
    Hey Kevin, just to let you know that I inspected the resources and there was no problem there. My flow’s k8s job was using at most 5% of the CPU and there was enough resources for everything to run in the cluster. Also,
    cpu_limit
    is just used to stop the job if it “outgrows” this resource requirement, and the
    LocalDaskExecutor
    will use the pod available cpu count (which is independent of this value) to create the number of processes (unless it’s overridden w/
    num_workers
    ). From what I managed to conclude it’s the number of cores of underlying node. So for now I will just use nodes with higher CPU count to achieve higher parallelism, but the problem of
    num_workers
    CPUs remains
    🙂 Thanks for the help in any case!