First time on here with at utterly Prefect naive question. I have an overwrought, buggy 'workflow' framework (called gendarme, never realized publicly) I developed to assist my computational biology research -- started it about 15 years ago without knowledge of other workflows other than Make. One of my needs was (and is) to fully define the output of every task in terms of the specification values that are used by the task itself, and the specification of all upstream tasks. By specifications, I mean parameters or variables that fully define, along with the code, the output -- ie not all parameters to a function are defining specifications (of course one might have intentional or unintentional stochasticity which create caveats to the notion of being fully defined). A single task tends to be a long running computationally expensive, often memory intensive, perhaps parallelized fairly complex procedures often defined by multiple classes. I might have several dozen or more of these tasks to approach a meaningful goal, but the final goal has usually been a moving 'goalpost' as yet another layer of analysis seems like a good idea at the time. A full DAG may well be years in the building up of analyses -- though likely at that point tasks (nodes, I called them cells) have been swapped out. Unlike Make (and as far as I can tell, Luigi and similar workflows), the gendarme DAG is discovered upon runtime rather than predefined. That was convenient, but also because the DAG is defined over instantiated tasks defined by their parameters, I never found much point in predefining the DAG externally.
I am at a major decision point of either doing a complete overhaul of gendarme or abandoning it and adopting some other workflow and modifying to my needs. I think those needs at this point distills in the ability to store intermediate results (they may have been final results until a downstream task was written that requires them) with the specification metadata that fully defines them, so that they can be retrieved when needed and not rebuilt without need. Time completed might be one of many specs, but typically a task might have a half dozen critical specifications that would need to match the specification state of the calling downstream task in order for the existing output to be congruent and useful. Part of the challenge has been that the downstream task knows its own specs, but it doesn't know the subset of the specs that define its upstream dependencies. One can't simply query the entire spec space as the DAG may have hundreds of parameters, but only those used in the requested subgraph are relevant to defining the outputs. Gendarme passes spec space both upstream to search for existing output, or build anew if it doesn't exist, and then back downstream to the calling task to capture the full description of each output, and then stored it as metadata in an external db for fast querying. I will need to do something like that again, both for efficiency, but also a scientific integrity position of fully knowing what ones results represent (and part of this is the ability to truly isolate training data influences on ever receding final test sets).
From a high level perspective, each task (gendarme calls them cells) calls other cells when required. If there is a breadth first part of the DAG one can explicitly run them concurrently for efficiency. Otherwise the graph runs as it is discovered depth first.
Sorry for the long winded request. If that made any sense, my question to you is -- should I pursue Prefect for these needs? Clearly I have a fairly extreme case use on the side of complex and slow. My datasets have been big historically, but not by today's standards -- it is the complexity and need for incremental building of fully defined outputs that appears to be more unusual.
Prefect looks inspired regardless, and eventually I'll want to learn more. But I would rather not get side-tracked at this moment studying the guts it there is a clear logical incompatibility.