https://prefect.io logo
Title
a

Avi A

04/30/2020, 2:21 PM
on a different note, what’s the best practice for knowing that a task was already executed and to skip its execution if that is the case? e.g. suppose if a specific data partition was already loaded, skip the extract+load phases related to it
👀 1
d

Dylan

04/30/2020, 2:28 PM
Hi @Avi A, I that depends on how you’re able to make that determination. We do support caching results: https://docs.prefect.io/core/concepts/persistence.html Since you’re storing everything on the local filesystem and your goal is to check to see if a specific partition was loaded, I would have your task check your output files. If you followed a specific naming convention, your task could check to see if a file already exists before it attempts the extract step. Then, your task could raise a skip signal. This indicates that everything went as expected but there was no need for your task to run: https://docs.prefect.io/api/latest/engine/signals.html#skip
a

Avi A

04/30/2020, 2:32 PM
Thanks. Caching sounds nice but I don’t need the whole data cached (since I persist it), only the indication that it went well… I can check in the load phase but since the extract and load tasks are separate and I’d like to keep them independent, it makes little sense for the extract task to check if it exists, which means that the extract phase will occur and only the load phase will be skipped. WDYT?
I used to work with Luigi. In Luigi it works quite similar to how a Makefile works. You specify the end goal (in our case, partition loaded into DB) and work your way upstream to see which tasks need to be run to fulfill that, and for each task you have a trigger to say if it even needs to run
d

Dylan

04/30/2020, 2:39 PM
That all depends on the world your flow lives in. If you’re trying to keep your flow from putting too much strain on the source data systems, then maybe it makes sense to check in the extract phase. If preventing duplicate information in the destination data system is a higher priority, then maybe you can check somewhere else
Personally, I don’t think there’s anything wrong with adding the check in the extract task. In general, I support patterns where tasks verify that they need to run (or even having upstream tasks that only make sure downstream tasks need to execute)
👍 1
especially when you have several mapped tasks in succession. The earlier you skip, the faster your flow should execute
👍 1
a

Avi A

04/30/2020, 2:48 PM
I agree, but with a caveat about the dependencies. Suppose I have a flow that takes data from one source and pours it into another. All good and that. Now imagine I want to add another destination for the data. Now, instead of just writing the task and wiring it to the output of the extraction task, I need to also modify the extraction task to check if the output of both load processes exist
that’s not so good engineering-wise
d

Dylan

04/30/2020, 2:51 PM
Makes sense! 😄
a

Avi A

05/03/2020, 6:28 AM
I ended up just doing the extraction+load in the same task and skipping at the beginning of it. Not the best practice, but since there’s only one input and one output I guess it’s not that bad
@Dylan do you think this PIN is related to my problem? https://docs.prefect.io/core/PINs/PIN-16-Results-and-Targets.html