curious if anyone has any thoughts on how to avoid...
# ask-community
b
curious if anyone has any thoughts on how to avoid this kind of looping behavior when the dask worker is repeatedly killed after running out of memory...is this what version locking is for? I can't remember exactly how that works and it doesn't seem to really be documented anywhere
k
Hey @Brett Naul, The restarts you’re seeing is a part of the Lazarus process - you can turn this off in your Flow settings for this flow.
b
this is all the same flow run, but multiple task runs...lazarus is at the flow run level I thought?
k
Hmm, so I thought Lazarus sees running flow runs without running task runs and kicks in to resubmit those tasks, but you may be right. Do you see any intervention in the logs of this FlowRun with each of the Starting Task run.?
b
just a whole lot of
Task '<task_name>': Starting task run...
over and over. the flow run status just stays "Running" throughout
🤔 1
k
Well I do think version locking can help here, as it will ensure your tasks only run once and hopefully expedite the final state of this FlowRun without these non-descript TaskRun starts. This is normally disabled by default. You can also send me this FlowRunID and I can take a look from this side for further information, these timestamps do seem strange if this is all one TaskRun. 😄
b
0fd1e166-ca54-4b56-999f-6da19479f79b
is the flow run ID but this happens all the time for us (because we OOM a lot). I think the
dask-worker
process is getting killed and dask is retrying the task over and over
k
I can definitely see that being an issue - taking a look
Hmm, I’m seeing a SKIP signal being raised with this TaskRun after it starts then this exact task is being resubmitted for execution. Does that sound familiar at all?
b
there are other tasks that should SKIP, this specific one should run (once)
k
Can you confirm the TaskRunID for this one? Feel free to DM me if you prefer
b
c47ec4c4-7a99-4365-9d25-915bd2fc2aa0
k
Thanks for the info
Looking deeper - I’m not sure there’s a graceful way to handle OOM from the Prefect side if version locking isn’t meeting your needs, nor how deep VL reaches for TaskRun retries/resubmissions. So it may be best to add a conditional state handler to signal the task run to end based on the number of previous runs for troublesome tasks like these:
Copy code
def too_many_times(task, old_state, new_state):
    if prefect.context.task_run_count > 1:
        raise signals.FAIL() 
    return new_state
If you’re going this route, I’m fairly certain that’s the context object we’d want here.
a
@Marwan Sarieddine and I have run into this problem a lot at various points. We don’t have anything to add on the solution side other than the only thing that ever worked was to try like hell to prevent the Python memory from going beyond the limits.
we have version locking in place we don’t actually know what it’s doing and how it’s helped. There’s nothing really saying “you hit an OOM and this task wasn’t restarted because of that”
b
@Alex Papanicolaou took a while but I figured out what was causing this for me: a long time ago (pre-prefect) we had set
distributed.scheduler.allowed_failures=1000000
in our dask config and used other logic to identify stalled tasks; when we moved to prefect and ditched our old hand-rolled built system we lost that logic so the underlying dask task would just re-run over and over bc of that config value. after reverting it to 3 we see a
KilledWorker
after a few retries like we wanted. probably not a very common situation but figured I'd follow up anyway