First time on here with at utterly Prefect naive q...
# prefect-community
r
First time on here with at utterly Prefect naive question. I have an overwrought, buggy 'workflow' framework (called gendarme, never realized publicly) I developed to assist my computational biology research -- started it about 15 years ago without knowledge of other workflows other than Make. One of my needs was (and is) to fully define the output of every task in terms of the specification values that are used by the task itself, and the specification of all upstream tasks. By specifications, I mean parameters or variables that fully define, along with the code, the output -- ie not all parameters to a function are defining specifications (of course one might have intentional or unintentional stochasticity which create caveats to the notion of being fully defined). A single task tends to be a long running computationally expensive, often memory intensive, perhaps parallelized fairly complex procedures often defined by multiple classes. I might have several dozen or more of these tasks to approach a meaningful goal, but the final goal has usually been a moving 'goalpost' as yet another layer of analysis seems like a good idea at the time. A full DAG may well be years in the building up of analyses -- though likely at that point tasks (nodes, I called them cells) have been swapped out. Unlike Make (and as far as I can tell, Luigi and similar workflows), the gendarme DAG is discovered upon runtime rather than predefined. That was convenient, but also because the DAG is defined over instantiated tasks defined by their parameters, I never found much point in predefining the DAG externally. I am at a major decision point of either doing a complete overhaul of gendarme or abandoning it and adopting some other workflow and modifying to my needs. I think those needs at this point distills in the ability to store intermediate results (they may have been final results until a downstream task was written that requires them) with the specification metadata that fully defines them, so that they can be retrieved when needed and not rebuilt without need. Time completed might be one of many specs, but typically a task might have a half dozen critical specifications that would need to match the specification state of the calling downstream task in order for the existing output to be congruent and useful. Part of the challenge has been that the downstream task knows its own specs, but it doesn't know the subset of the specs that define its upstream dependencies. One can't simply query the entire spec space as the DAG may have hundreds of parameters, but only those used in the requested subgraph are relevant to defining the outputs. Gendarme passes spec space both upstream to search for existing output, or build anew if it doesn't exist, and then back downstream to the calling task to capture the full description of each output, and then stored it as metadata in an external db for fast querying. I will need to do something like that again, both for efficiency, but also a scientific integrity position of fully knowing what ones results represent (and part of this is the ability to truly isolate training data influences on ever receding final test sets). From a high level perspective, each task (gendarme calls them cells) calls other cells when required. If there is a breadth first part of the DAG one can explicitly run them concurrently for efficiency. Otherwise the graph runs as it is discovered depth first. Sorry for the long winded request. If that made any sense, my question to you is -- should I pursue Prefect for these needs? Clearly I have a fairly extreme case use on the side of complex and slow. My datasets have been big historically, but not by today's standards -- it is the complexity and need for incremental building of fully defined outputs that appears to be more unusual. Prefect looks inspired regardless, and eventually I'll want to learn more. But I would rather not get side-tracked at this moment studying the guts it there is a clear logical incompatibility.
j
Hi @Richard, welcome to Prefect! Certainly the most important thing in selecting (or building!) a workflow system is ensuring its philosophy aligns with your way of thinking. Based on your description, Prefect will match some but not all of your expectations, so I want to try to outline what those are:
The nature of your tasks (long running, dependent on inputs and producing outputs, computationally expensive) is no issue. Prefect will run any function and is only limited by the compute resources available. The discovery of the DAG at runtime, however, may be an issue. Prefect expects the DAG topology to be defined ahead of time, then executed. During execution, there is the ability to add certain dynamic elements like parallelism via
map
operators, but all tasks (and their dependencies) are expected to be defined in advance of execution. However, when you say “the DAG is defined over instantiated tasks defined by their parameters”, that is how a Prefect DAG is built, so perhaps there’s a nuance here that I’m missing.
If the dynamic discovery of downstream tasks is important, then we might encourage you to take a look at Dask, which is the task scheduling engine we recommend running Prefect on top of.
Prefect adds a higher-level API that can be used on top of Dask, but by accessing Dask directly you could do things like generate downstream tasks while the DAG is running. You would, however, need to add caching mechanisms and other things like that yourself. However, since you are considering a custom framework, I’d recommend Dask as the underlying engine.
For example, the Dask delayed interface could be a good place to start.
r
Thanks Jeremiah -- I appreciate the pointers. Have just started looking into Dask as well and will continue with that. Yes my description wasn't great -- gendarme's DAG is defined at runtime and thus over 'instantiated' outputs -- so for instance (no pun intended) if the 'external' predefined graph looked like Y(a,b,c)-> Z(x,y,a) (eg I have some task Z that uses variables x,y,a and to define its output, and it needs the output of Y which uses a,bc to define its output), then the DAG would be created when Z with say x=11,y=22, a=99 calls Y (the downstream variables handled by the framework to avoid parameter explosion) that gets b & c from some config file as 1&2, the edge is created then between Y(a=99, b=1,c=2) -> Z(x=11,y=22,a=99,b=1,c=2) or some such thing. Then if some new downstream T also needs Y then if it's calling variables / config state, etc match (a=99,b=1,c=2), then it can just retrieve the existing output (and another edge of the DAG created Y->T). Otherwise Y is built according to T's specs, and a new node Y(according to Y and T) is added to the DAG. That's kind of what I mean by the DAG describing runtime instantantiated dependencies rather than a predefined external graph over non-instantiated nodes.
At any rate, even if I want to try to use Prefect for this in some way it looks like grokking Dask a necessary pre req. Thanks again.
j
Thank you, I think I better understand now. This is an interesting approach that might be possible in Prefect by manipulating the built-in caching mechanisms, but my fear is that the idiom would be too stretched for the framework to effectively insure all of your goals. I think implementing custom caching logic over Dask might be the best place to start.
d
This might not be directly relevant but maybe it is. A couple of folks at my company are volunteering to try and make a simple text based way of specifying a DAG that Dask can then run. We are doing this to help some computational biologist run the kind of workflows you are talking about easier on top of Dask and/or HPC systems for Covid-19 analyses. https://github.com/Quansight-Labs/dask-jobqueue/issues/1