Getting some odd behaviour with a task runs fine and finishe Prefect Community #ask-community

Getting some odd behaviour with a task, runs fine ...

Robert Hales

08/24/2021, 10:36 AM

Getting some odd behaviour with a task, runs fine and finishes successfully - but is then started again? The task timer duration stops but is printing to logs. Using

DaskExecutor

Robert Hales

08/24/2021, 10:36 AM

It is a long running task (57mins)

Robert Hales

08/24/2021, 10:50 AM

Looking at my dask worker logs, after this task had finished and moved on to other tasks in the flow, the worker ran out of memory. The worker was then restarted and it seems to have re ran all of the tasks in the flow.

Marwan Sarieddine

08/24/2021, 11:38 AM

@Robert Hales FWIW - we have also experienced the same thing when tasks run out of memory then occasionally restart. We thought at first enabling prefect cloud’s task version locking would prevent the task from running again but that wasn’t always the case. The workaround we ended up adopting for critical tasks is to save the task run parameters to a table and then query the said table at the start of the task to check if the task has already been run - if it has we short-circuit the task run ..

Robert Hales

08/24/2021, 11:43 AM

Hmmmm interesting, thanks for the info. Is this the expected behaviour from prefect or is it a limitation on the dask side?

Marwan Sarieddine

08/24/2021, 11:54 AM

good question - I defer to the prefect team to answer this because I am not entirely sure

Marwan Sarieddine

08/24/2021, 12:09 PM

I guess for some additional info - A worker restart is mentioned in dask’s memory management docs (https://distributed.dask.org/en/latest/worker.html#memory-management)

Copy code

At 95% of memory load (as reported by the OS), terminate and restart the worker

Marwan Sarieddine

08/24/2021, 12:10 PM

There is an option when configuring dask to turn off memory management - but I wonder if that is recommended by prefect (this is done by setting memory_limit=0)

davzucky

08/24/2021, 12:56 PM

Can you check if the heartbeat is not the problem? I had a problem like that and disabling the heartbeat on the flow fix the problem

Robert Hales

08/24/2021, 1:25 PM

I like the worker restart behaviour as we use libpostal which eats a tonne of memory and doesnt seem to be freed. However, I would expect prefect to pick up where it left off.

Robert Hales

08/24/2021, 1:26 PM

There was no mention of heartbeat in the logs and there was these three lines in the dask worker logs:

Copy code

2021-08-24T11:19:08.209Z	distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting

2021-08-24T11:19:08.320Z	distributed.nanny - INFO - Worker process 166 was killed by signal 15

2021-08-24T11:19:08.427Z	distributed.nanny - WARNING - Restarting worker

Kevin Kho

08/24/2021, 1:50 PM

Hey @Robert Hales, what do your Prefect logs look like? Anything with Lazarus?

Robert Hales

08/24/2021, 1:51 PM

Hey @Kevin Kho, no nothing with Lazarus - all the expected logging output from my tasks.

Kevin Kho

08/24/2021, 1:55 PM

When you say the all the tasks re-ran, do you see other evidence besides logs (files produced? memory used? etc)? Does it take the same amount of time?

Robert Hales

08/24/2021, 1:57 PM

Yep, dask worker is clearly working on the same task again - same memory usage, same length of time

Robert Hales

08/24/2021, 1:58 PM

The task ended up running for a third time

Kevin Kho

08/24/2021, 2:02 PM

Ok will ask the team about this

Kevin Kho

08/24/2021, 2:05 PM

In the meantime though, maybe explicit caching would help you?

Robert Hales

08/24/2021, 2:08 PM

Yes have been looking into that, however I really want to re run the tasks when ever the flow is re run. My understanding is that caching would persist across flow runs?

Kevin Kho

08/24/2021, 2:10 PM

Yes but you can invalidate it or use a lower duration

Robert Hales

08/24/2021, 2:14 PM

Okay, cheers will keep looking into it

Kevin Kho

08/24/2021, 2:48 PM

Hey, so this is expected behavior because Prefect submits the tasks to Dask and Dask makes the computation graph. If something is crashes, upstream tasks will have to be recomputed from Dask’s point-of-view. In order to run only once, you can try turning on Version Locking for the flow in the Flow settings so that a result will be loaded.

Marwan Sarieddine

08/24/2021, 3:04 PM

@Robert Hales apologies on the selfish request but I would appreciate it if you give the version locking a try, because I had tried it a while back and I am curious if it is more reliable by now …

Robert Hales

08/24/2021, 3:07 PM

Interesting okay, as this is expected I think it could be a little more clear what is going on - it kind of looks like a bug with tasks running but durations stopped etc.

Robert Hales

08/24/2021, 3:08 PM

If version locking is cloud only I cant unfortunately @Marwan Sarieddine

Marwan Sarieddine

08/24/2021, 3:09 PM

I see - thanks for letting me know …

Robert Hales

08/24/2021, 3:28 PM

@Kevin Kho Is this behaviour the same on the

LocalDaskExecutor

Kevin Kho

08/24/2021, 3:33 PM

I am not sure, but I suspect that if you crash on

LocalDaskExecutor

due to memory, there is no

nanny

to help you restart the worker (because all the workers live in the same place). The flow will likely fail and then have to be restarted but it should restart at the right place

Robert Hales

08/24/2021, 5:05 PM

Interesting, will continue to look into this and let you know if anything else comes up. Cheers as always @Kevin Kho!

👍 1

Robert Hales

08/26/2021, 8:50 AM

Hey @Kevin Kho, trying to understand if checkpointing can help me here? I set

checkpoint=True, result=PrefectTask()

on my task, and the result was populated in the ui, but when the downstream task caused worker reboot (

sys.exit()

) the checkpointed task reran?

Kevin Kho

08/26/2021, 2:18 PM

Hey so there is a difference between checkpoint and caching (and I think both can help). Checkpointing is related to persisting results of a task. When the task runs, the output is saved to the Result. So when you restart a flow, it would load the results of the tasks that already succeeded if they are needed. Caching on the other hand, is applied to future Flow Runs. For example, you cache a task for 24 hours, it won;t across other flow runs for 24 hours. In your case, I think you just want you restart the same flow run so you would just checkpoint . Upon flow restart, it will fetch the result of already SUCCESSFUL tasks. Are you seeing it running again entirely?

Robert Hales

08/26/2021, 3:08 PM

Okay, thanks, thats what I interpreted as the expected behaviour too. I was seeing Running -> Successful -> Running -> Successful. Will try knock up a MRE for you

Robert Hales

08/26/2021, 3:32 PM

Copy code

import sys
import time

import prefect
from prefect import Flow, task
from prefect.engine.results import PrefectResult


@task(checkpoint=True, result=PrefectResult())
def task_to_checkpoint():
    logger = prefect.context.get("logger")
    <http://logger.info|logger.info>("I should be checkpointed!")
    time.sleep(10)
    return [1, 2, 3]


@task
def bad_task(a):
    sys.exit()


with Flow("checkpointing") as flow:
    bad_task(task_to_checkpoint())

Robert Hales

08/26/2021, 3:33 PM

the worker being ran with

dask-worker 192.168.0.46:8786 --nprocs 1 --nthreads 1

Robert Hales

08/26/2021, 3:34 PM

I would expect to only see

I should be checkpointed!

once and

bad_task

to be retried, however this is not the behaviour I see.

Kevin Kho

08/26/2021, 3:55 PM

Oh I see. So Prefect in general does not play nicely with the

sys.exit()

calls because there is some exit logic that needs to happen. I would suggest you

raise FAIL

instead to fail that task.

Robert Hales

08/26/2021, 4:04 PM

The

sys.exit

is to emulate the worker being killed by the nanny for memory reasons, so in that case I cant raise a FAIL

Robert Hales

08/26/2021, 4:56 PM

The state is correctly updated to success and the result populated before the sys.exit so prefect should be able to recover, despite the exit logic not running?

Kevin Kho

08/26/2021, 4:59 PM

Oh I see what you’re going for. I think it should be, but I’ll give this a test in a bit

Robert Hales

08/26/2021, 5:04 PM

Cheers! Clocking off time this side of the pond, will catch up tomorrow

Kevin Kho

08/26/2021, 5:08 PM

Did this task fail for you? It doesn’t even fail for me. Just running indefinitely.

Kevin Kho

08/26/2021, 5:10 PM

Are you restarting the same flow run or creating a new flow run?

Robert Hales

08/27/2021, 8:15 AM

Yeah it failed eventually with

Unexpected error: KilledWorker('bad_task-2c4dec7ec8e142589c415e5d673b6843', <WorkerState '<tcp://10.130.46.68:49660>', name: <tcp://10.130.46.68:49660>, memory: 0, processing: 2>)

, but again this is just emulating that memory leak. So in the real tasks, the worker would have high unmanaged memory -> gets killed by nanny -> then flow is rerun with more available memory and succeeds. However during this re-run all of the successful tasks are run - even if they have checkpointing on (like in the example).

Robert Hales

08/27/2021, 8:23 AM

Obviously the best case scenario would be not having memory leaks, but thats down to some libs we use.

Robert Hales

08/27/2021, 8:23 AM

I am not restarting anything through the UI, this is all done by dask/prefect in the background on worker loss

Robert Hales

08/30/2021, 8:34 AM

@Kevin Kho any ideas on this???

Kevin Kho

08/30/2021, 2:57 PM

Hey sorry @Robert Hales, this slipped me. Will test now

Kevin Kho

08/30/2021, 3:44 PM

Hey I’m not sure there is anything we can do for you on the Prefect side because the successful tasks are restarted due to Dask’s computation graph. From Dask’s point of view, it needs to recompute all the upstream dependencies. This is why we have version locking on Cloud to address this. I am not seeing a way to get this to exit more gracefully also. If it could, then I believe the upstream tasks would be respected. Maybe you’ve seen this already, but the best suggestion I have would be to potentially reduce unmanaged memory with

this▾

. There is an env variable he shows. I have also seen people suggest upgrading to 2021.6.0 for better memory management.

Kevin Kho

08/30/2021, 4:03 PM

I chatted with the team and maybe you can use a

state_handler

where the

old_state

is Success and the

new_state

Running

. Short-circuit this task from running by directly returning the Success state.

Robert Hales

08/30/2021, 6:53 PM

Thanks for this, will look at the provided link and the state handler. Cheers!

2 Views

Open in Slack

Previous Next