Brett Naul
04/05/2021, 6:12 PMBrett Naul
04/05/2021, 6:19 PMKyle Moon-Wright
04/05/2021, 6:25 PMBrett Naul
04/05/2021, 6:31 PMTask '<task_name>': Starting task run...
over and over. the flow run status just stays "Running" throughoutKyle Moon-Wright
04/05/2021, 6:46 PMBrett Naul
04/05/2021, 6:49 PM0fd1e166-ca54-4b56-999f-6da19479f79b
is the flow run ID but this happens all the time for us (because we OOM a lot). I think the dask-worker
process is getting killed and dask is retrying the task over and overKyle Moon-Wright
04/05/2021, 6:55 PMKyle Moon-Wright
04/05/2021, 7:06 PMBrett Naul
04/05/2021, 7:07 PMKyle Moon-Wright
04/05/2021, 7:13 PMBrett Naul
04/05/2021, 7:14 PMc47ec4c4-7a99-4365-9d25-915bd2fc2aa0
Kyle Moon-Wright
04/05/2021, 7:33 PMKyle Moon-Wright
04/05/2021, 7:35 PMdef too_many_times(task, old_state, new_state):
if prefect.context.task_run_count > 1:
raise signals.FAIL()
return new_state
If you’re going this route, I’m fairly certain that’s the context object we’d want here.Alex Papanicolaou
04/05/2021, 7:54 PMAlex Papanicolaou
04/05/2021, 7:55 PMBrett Naul
05/05/2021, 12:10 PMdistributed.scheduler.allowed_failures=1000000
in our dask config and used other logic to identify stalled tasks; when we moved to prefect and ditched our old hand-rolled built system we lost that logic so the underlying dask task would just re-run over and over bc of that config value. after reverting it to 3 we see a KilledWorker
after a few retries like we wanted. probably not a very common situation but figured I'd follow up anyway