haf
11/24/2021, 11:34 PMhaf
11/24/2021, 11:38 PMJake Kaplan
11/25/2021, 12:16 AMJake Kaplan
11/25/2021, 2:35 AMdask_distributed.worker_client
from within tasks.
I cannot guarantee this is recommended but it appears possible. Here is an example from another thread where someone seems to be doing this: https://prefect-community.slack.com/archives/CL09KU1K7/p1634664171072100?thread_ts=1634556225.467000&cid=CL09KU1K7
I will forward this question to see if I can get you a better answerKevin Kho
haf
11/25/2021, 1:04 PMA -> B(checkpoint=False) -> C(checkpoint=True) -> D
A is config
B is creating a ML model in a very specific format that crashes the serialiser if I checkpoint it
C is a computation that takes about 1½ hours but that may be interrupted
D is a save-to-storage task that is happily retried and is idempotent
How does all of this work together? I'm specifically looking at crash-resume remantics (not grey failure here). A crashes: rerun from start, B: rerun from cached A, C: rerun from cached A, then B, because B is not checkpointed, D : retry with cached result from C?
Now if I add worker_client
into the mix, am I right to assume that if the parent node, say (a) crashes, while work has been distributed across (b) and (c) also, the flow goes like the previous paragraph, but if (b) crashes, Dask will transparently rerun it?haf
11/25/2021, 1:05 PMKevin Kho
KilledWorker
error I thinkhaf
11/26/2021, 1:22 PMKevin Kho