Tom Klein
07/14/2022, 11:35 AMX
that spawn several "children" tasks Y1
... Yn
let's say some of them are stuck on Running
as a finished state - but i wanna guarantee the next phase in the flow executes correctly and ignores the few "hung" tasks --
what's the (default) behavior for if i set those tasks (e.g. Y3
, Y17
and Y91
) to Skipped
? would the next task that depends on them still get executed (even if it has the default all_successful
trigger?)
the reason i'm asking about Skipped
is because i wanna avoid a None
response flowing downstream from these tasksAnna Geller
07/14/2022, 11:37 AMTom Klein
07/14/2022, 11:37 AMresults = X.map(inputs)
final_output = Z(result)
Anna Geller
07/14/2022, 2:11 PMTom Klein
07/14/2022, 2:22 PMSkipped
- would they still run ?Anna Geller
07/14/2022, 2:22 PMTom Klein
07/14/2022, 7:17 PMSkipped
and then restarted the flow --- and then the downstream task was skipped as well and i got :
Upstream task was skipped; if this was not the intended behavior, consider changing `skip_on_upstream_skip=False` for this task.
on the downstream task
is there any other state i can put them on that would force the downstream task to run for this existing flow run ?
(running the entire flow takes 24h so i don’t wanna redo it)
or is there another way to force the downstream task to run?Anna Geller
07/14/2022, 8:01 PMTom Klein
07/14/2022, 8:05 PMNone
i looked into the GraphQL docs and there doesn’t seem to be a way to change a task’s trigger on the fly? (if i could just change it to any successful
i guess it would be enough?import cloudpickle
with open('salvage_me', 'rb') as f:
zoom = cloudpickle.load(f)
mylines = '<some valid input>'
zoom[577] = tuple([zoom[577][0], mylines])
zoom[578] = tuple([zoom[578][0], mylines])
zoom[579] = tuple([zoom[579][0], mylines])
with open('salvaged', 'wb') as f:
cloudpickle.dump(zoom, f)
Prefect
didn’t like it, and my flow is now stuck on this:negative engineering
😞Anna Geller
07/14/2022, 10:27 PMTom Klein
07/14/2022, 11:33 PMskip_on_upstream_skip=False
. If the goal is to prevent negative engineering - why not give users more power over their existing flow run - since changing the code would necessarily obligate a new flow run which could be costly (both in time and in money or other resources)
• ultimately, these 3 tasks failed due to various "weaknesses" in my own code and/or in Prefect - namely:
◦ my shell script did not create an output when the input was empty (this then led the task to fail when it tried to read the output file and provide it as a result of the task)
◦ their input was empty because i had a small bug in my code, in the upstream task which splits the CSVs
◦ even if i wanted to force the failed tasks to be marked (artificially) as Successful
this would not have been enough because the downstream task assumed all its inputs are valid (e.g. no None
)
◦ the downstream task did not have skip_on_upstream_skip=False
set on it, so i could not skip the upstream tasks
◦ the trigger for the downstream task was all successful
(which is the default) so it did not tolerate even 3 failures out of 580
• I started to write here various ideas for how Prefect could be more "fool-proof" but honestly i'm not sure i have a clear and universally good idea at the moment - most of the ideas i had basically involved users themselves taking part in some defensive engineering (which you like to call negative engineering ) -- which is exactly the kind of thing Prefect aims to eliminate.
• but, there must be some ways for Prefect to be more tolerant to issues like the ones i described, and make it easier to recover from catastrophes without repeating work and wasting resources... 🤔 if users have to write 100% correct Prefect flows always ahead of time (and a single wrong attribute or trigger could make the entire thing fail after tons of time & resources have been spent running the flow) --- then this just pushes the negative engineering problem one level "upwards", from the business logic to the orchestrating-code level --- without eliminating it: instead of making sure my own code can deal with failures, i'm making sure prefect can deal with failures, since a single misconfiguration could spell disaster (in this case, i only managed to salvage it with some "hacking" - i.e. artificially manipulating checkpointed data on S3
)Anna Geller
07/15/2022, 1:28 AM