curious if anyone has any thoughts on how to avoid this kind Prefect Community #ask-community

curious if anyone has any thoughts on how to avoid...

Brett Naul

04/05/2021, 6:12 PM

curious if anyone has any thoughts on how to avoid this kind of looping behavior when the dask worker is repeatedly killed after running out of memory...is this what version locking is for? I can't remember exactly how that works and it doesn't seem to really be documented anywhere

Kyle Moon-Wright

04/05/2021, 6:18 PM

Hey @Brett Naul, The restarts you’re seeing is a part of the Lazarus process - you can turn this off in your Flow settings for this flow.

Brett Naul

04/05/2021, 6:19 PM

this is all the same flow run, but multiple task runs...lazarus is at the flow run level I thought?

Kyle Moon-Wright

04/05/2021, 6:25 PM

Hmm, so I thought Lazarus sees running flow runs without running task runs and kicks in to resubmit those tasks, but you may be right. Do you see any intervention in the logs of this FlowRun with each of the Starting Task run.?

Brett Naul

04/05/2021, 6:31 PM

just a whole lot of

Task '<task_name>': Starting task run...

over and over. the flow run status just stays "Running" throughout

🤔 1

Kyle Moon-Wright

04/05/2021, 6:46 PM

Well I do think version locking can help here, as it will ensure your tasks only run once and hopefully expedite the final state of this FlowRun without these non-descript TaskRun starts. This is normally disabled by default. You can also send me this FlowRunID and I can take a look from this side for further information, these timestamps do seem strange if this is all one TaskRun. 😄

Brett Naul

04/05/2021, 6:49 PM

0fd1e166-ca54-4b56-999f-6da19479f79b

is the flow run ID but this happens all the time for us (because we OOM a lot). I think the

dask-worker

process is getting killed and dask is retrying the task over and over

Kyle Moon-Wright

04/05/2021, 6:55 PM

I can definitely see that being an issue - taking a look

Kyle Moon-Wright

04/05/2021, 7:06 PM

Hmm, I’m seeing a SKIP signal being raised with this TaskRun after it starts then this exact task is being resubmitted for execution. Does that sound familiar at all?

Brett Naul

04/05/2021, 7:07 PM

there are other tasks that should SKIP, this specific one should run (once)

Kyle Moon-Wright

04/05/2021, 7:13 PM

Can you confirm the TaskRunID for this one? Feel free to DM me if you prefer

Brett Naul

04/05/2021, 7:14 PM

c47ec4c4-7a99-4365-9d25-915bd2fc2aa0

Kyle Moon-Wright

04/05/2021, 7:33 PM

Thanks for the info

Kyle Moon-Wright

04/05/2021, 7:35 PM

Looking deeper - I’m not sure there’s a graceful way to handle OOM from the Prefect side if version locking isn’t meeting your needs, nor how deep VL reaches for TaskRun retries/resubmissions. So it may be best to add a conditional state handler to signal the task run to end based on the number of previous runs for troublesome tasks like these:

Copy code

def too_many_times(task, old_state, new_state):
    if prefect.context.task_run_count > 1:
        raise signals.FAIL() 
    return new_state

If you’re going this route, I’m fairly certain that’s the context object we’d want here.

Alex Papanicolaou

04/05/2021, 7:54 PM

@Marwan Sarieddine and I have run into this problem a lot at various points. We don’t have anything to add on the solution side other than the only thing that ever worked was to try like hell to prevent the Python memory from going beyond the limits.

Alex Papanicolaou

04/05/2021, 7:55 PM

we have version locking in place we don’t actually know what it’s doing and how it’s helped. There’s nothing really saying “you hit an OOM and this task wasn’t restarted because of that”

Brett Naul

05/05/2021, 12:10 PM

@Alex Papanicolaou took a while but I figured out what was causing this for me: a long time ago (pre-prefect) we had set

distributed.scheduler.allowed_failures=1000000

in our dask config and used other logic to identify stalled tasks; when we moved to prefect and ditched our old hand-rolled built system we lost that logic so the underlying dask task would just re-run over and over bc of that config value. after reverting it to 3 we see a

KilledWorker

after a few retries like we wanted. probably not a very common situation but figured I'd follow up anyway

3 Views

Open in Slack

Previous Next