a

    Anders Segerberg

    7 months ago
    I have encountered a curious situation, where SOME mapped tasks appear to be "Zombies" -- they run for tens of minutes on end, when they average job is <1min; the UI shows the task status as "Running", with the run duration increasing. Of 40 Dask worker vCPUs, as Flow runtime increases, maybe 15 or so at any one point, will be these "zombies". What's strange is that it is possible to "jolt" these tasks out of this zombified Running state, by going into the UI for the Task. Once this happens, without fail, the task status will resolve to "Success" after a minute or so; and the Dask worker running it is freed up for a new task. (Usually, I find opening the Logs tab, the quickest way to do this.) It's clear that what we are doing, is indeed "jolting" the task awake, forcing it to finish its state transitions; this is observable in the increased workflow throughput, due to the now-freed vCPU Dask worker. So, I don't think this is simply a lag in the UI. Is this a known issue?
    Kevin Kho

    Kevin Kho

    7 months ago
    This is not known. Looks like we got Schrodinger’s task here. I can’t imagine why viewing the UI for the task would “jolt” it because it will query the API which gets the info from the database. So the only thing I can potentially see is the API itself is hanging? Maybe you can try scaling the API container/pod?
    Also I guess as a gut feel check, this only happens are “large” scale right?
    a

    Anders Segerberg

    7 months ago
    I suppose it is for large scale workflows, I've only seen it in that context. We map a task over 76k runs; the average mapped task will take ~45s to run. We use the Dask executor and end up with 40 concurrent workers. Memory usage of the machine running the Prefect services via docker-compose is fine, not approaching any dangerous thresholds, and vCPU usage is also reasonable.
    I think what I'll do is run this Flow and look for some of these Schrodinger tasks, and collect info on them via gQL. However, I haven't found the fully documented parameters you can access via gQL -- could you point me to that section of the docs, please?
    Kevin Kho

    Kevin Kho

    7 months ago
    The documentation is attached in the Interactive API tab on the right side. We don’t have any specific page.
    I think it would specifically be memory usage of the API container right?
    a

    Anders Segerberg

    7 months ago
    I guess, yeah
    Can you manually send a heartbeat?
    Kevin Kho

    Kevin Kho

    7 months ago
    Heartbeats are sent automatically by Prefect Flows
    a

    Anders Segerberg

    7 months ago
    I can definitely confirm that opening up the Task information (clicking on it / going to logs) frees up the worker. I let the workflow run for some 14 hours overnight without inspecting it; this morning, it had only gotten through 13k mapped tasks (out of 76k.) I had the Prefect tab already loaded in my browser, but it was suspended. By reloading the window, without clicking on any UI components, in the main task window (the larger box to the right of the "Activity" column), I could see several tasks in RUNNING with the duration of ~11 hours; and other tasks that were running properly (so not zombies), running for a minute and then transitioning state. Without doing anything, after a few seconds, the duration of the stalled tasks changes to something on the order of 5 minutes, 8 minutes, etc; down from ~11 hours. (These smaller durations are expected for certain runtime failure cases, based on the timeouts I have set in various places.) But, the duration keeps ticking up from there. I don't understand why Prefect goes from showing 11 hours to the more reasonable time, upon loading; something in the internal state / timing / browser cache must be responsible. Finally, opening up each of these tasks, after taking a minute or two to load, will show the completed task execution graph. Most of them are failures for exceptions related to my database -- expected FAIL cases. A few of them are SUCCESSes. Now we've freed some 11 workers. Checking back two hours later, 56k mapped tasks have completed; compared to 11k completed over the previous 14 hours. Obviously, jolting the tasks increased throughput Resource usage appears unchanged. And, last night, a few hours into the flow, vCPU usage was around 30%, and memory around 25%, for the Prefect machine. So, no problems there.
    Two thoughts:1. Though the Flow settings have Zombie tasks and Lazarus enabled, is it possible that some Flow kwargs might result in overriding these? (We don't set any kwargs related to Lazarus or Zombie tasks)
    Kevin Kho

    Kevin Kho

    7 months ago
    Will ask the team about this. I don’t think you can remove those with Flow kwargs
    a

    Anders Segerberg

    7 months ago
    We are on Prefect 14.22
    2. The majority of the tasks that stall are Failures related to UNCAUGHT database (I'm using Mongo) exceptions.
    So maybe it has something to do with recovering from an uncaught exception within the mapped task. Finally, for seemingly normal usage, we are using about 50% of the disk space available in the directory prefect is installed in (home directory). We also let some of our database queries spill into disk space if they need it. So it is possible, I suppose, that we are exhausting disk space. I don't know how this would manifest or how likely it is
    Also, I think the "11 hours run duration" thing printed in the UI is a red herring. When I navigate away from the UI, and then back to it, it seems to show the duration of a running task as relative to the last time I visited the UI. For example, leaving the UI to type the above messages, and then going back, shows a run duration of 9minutes +45; but it then quickly reverts to 45seconds, the actual run duration.
    Kevin Kho

    Kevin Kho

    7 months ago
    I asked but team but noone has any immediate ideas. Yes though the the UI duration is a red herring. The duration is calculated on the fly by the UI rather than tracked by Prefect during Flow execution. So in that case, the end time is not updating properly. I believe exhausting disk space should lead to errors that are caught because other people have ran into that. You can also turn off results if you don’t need them, or specify a filename so that the latest file consistently overwrites the result file. The Mongo failures is interesting, but I’m not seeing either why a “jolt” helps with that 😅. I totally believe you, I just can’t wrap my head around it. 0.15.5 I think introduced some things to raise exceptions but I’m not super sure it would raise this
    a

    Anders Segerberg

    7 months ago
    Thanks for checking. For now I am just going to prune out problematic data so the tasks don't get run; at some point I'd like to investigate this in the future, and will post back to this thread.