https://prefect.io logo
s

Sergey Gerasimov

12/08/2020, 6:58 PM
Hi guys, I have the following task to be automated. There a lot of CREATE TABLE SQL-scripts with dependecies (in FROM JOIN) between tables to be created and other long-living tables. These scripts are partially re-used from project to project. So in one project we use some part of these scripts, in onther project - another part. In each project we have some existing table which is used as input for these scripts (in FROM / JOIN part) And this code based growths. It looks like this codebase can be automatically (via SQL parser) represented as some DAG and in each project we can deal with subDAG parametrized by initial table. Is it good idea to use prefect as backend for modeling and executing such DAGs? We already implemented some prototype on Makefiles and actively use ability of GNU make to track dependencies between script's modification time and table creation time.. It makes possible to make only necessary work if some SQL-scripts are modified. As I understood such time-based execution model is not implemented in prefect. Am I right? Is it good idea to implement it manually? Does prefect have some support of Oracle database for results? Or maybe I should look on other solutions?
j

Jim Crist-Harif

12/08/2020, 8:11 PM
Hi Sergey, I'm not sure I understand what you're asking, but a few points: • Prefect works fine with make-like execution results, where tasks are only recomputed as needed. See https://docs.prefect.io/core/concepts/persistence.html, in particular around task targets. This would only detect if the expected output isn't present (and would skip the task if it already exists), it wouldn't check if the code needed to prodce that output changed (you'd need to manage that on your own). • Prefect doesn't have a DB-based result type (we could add one), but I'm not sure it makes sense. If you want prefect to check if a DB table/row exists and do something if it doesn't you'd need to implement that inside your prefect task, rather than relying on prefect
Result
types to manage that for you. The general answer is - prefect is a general workflow dag system. We have primitives around tasks, collections of tasks (flows), and task outputs (results) - so anything you can do in one workflow tool should be expressible in prefect. You might have to build some custom functionality to detect changes/skip tasks how you want, but there's nothing about prefect that should make this impossible.
s

Sergey Gerasimov

12/10/2020, 11:54 AM
Hi Jim, thanks a lot for the reply! I wonder why most (but not all of sure) DAG projects try to be general and do not add specific (probably they are not specific but less common) useful features like advanced tracking dependencies like in GNU make or even content-based hashing of inputs or code. Now I see lot of very nice projects like prefect.io. But they provide +- the same functionality.. I think more specific but useful features like I describe could be an advantage in comparison to competitors.. I saw some discussion in kedro https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90 But it is just discussion for now. Anyway I really like your general model but also vote for working on advanced functionality 😉
3 Views