https://prefect.io logo
a

Arsenii

04/16/2020, 4:32 AM
Hi all! I've been trying to write a ""dynamic"" flow, and wondered if anyone has any comments. The flow downloads a list of objects
A
, and then maps 5 tasks over that list. Pretty straightforward, until the part where objects
A
include information about what other objects
A'
(A dash) they depend on... And those
A'
objects have to be processed and mapped over the same tasks as
A
, before
A
. This can go several layers deep, with
A->A'->A''
dependencies that need to be taken care of dynamically. The most naive solution is just to insert the dependencies at the beginning of the original list, hence making some kind of priority queue, and mapping over the tasks. However, this would not work with a DaskExecutor -- since everything is in parallel. What I guess I need here is "sub-flows" that can be mapped over a list of lists. It seems there's some discussion on it going on https://github.com/PrefectHQ/prefect/issues/1745 , but since it's far from release yet, do y'all think a similar thing can be hacked together, now? Thanks!!
j

Jeremiah

04/16/2020, 11:52 AM
Hi @Arsenii, based on what you’ve described I would recommend adopting Prefect in the following way: 1. a first task returns a list of objects
A
2. a second task maps over that list, and for each item in
A
retrieves and returns all other items
A, A', A'', ...
that are required to resolve it. This could be done recursively or with Prefect’s
LOOP
operator. The result of this map operation is a list of lists, where each sub-item contains all dependencies for the original objects 3. a third task maps over the second task. Its input is a list of
[A, A', A'', ...]
dependencies and it returns a final result. 4. The output of the third task is therefore a list of processed items, including data from dynamically discovered dependencies. In general, Prefect will work best when you know the logical graph structure in advance. Therefore, a broad strategy for effective flows with dynamic dependencies is to include tasks in your graph that load the dynamic data, but which allow you to know the graph structure ahead of time.
a

Arsenii

04/17/2020, 2:01 AM
Hi @Jeremiah , it seems like your approach is what I've meant by a "priority queue" -- but there's an important caveat that
A'
must be processed before
A
(and
A''
before
A'
) Basically I need to create a tree of dependencies and traverse it in a depth-first postorder fashion. The "one list with all objects" approach does not play well with parallelism, since I need to complete some tasks (always the same) on one level before going to the next... And that's where a sub-flow with those tasks, or a task-looping mechanism would have helped
j

Jeremiah

04/17/2020, 3:35 AM
My suggestion is slightly different than your priority queue, since it loads all dependencies into a list of lists (the output of the third task), then processes each sub-list in parallel. For example, the first element of the list is
[A, A', A'']
and the second element is
[B, B', B'']
and you could process those in any order, dynamically, by mapping over the parent list. However, your primary motivation of depth-first runtime dependency discovery is one that we don’t have first-class support for; I suspect in order to use Prefect you may have to move some dependency resolution logic into your tasks themselves.
🙇‍♂️ 1
9 Views