hi there! i'm trying to understand what is happeni...
# prefect-community
j
hi there! i'm trying to understand what is happening with subflows, infrastructure-wise. in particular, where they're going to be executing. more detail in thread
1
let's say this is the scenario:
Copy code
@flow
def my_subflow(num):
    intense_calculation(num)

@flow
def my_parent_flow():
    subflow = [my_subflow(i) for i in range(100000)]
    asyncio.gather(*subflows)
so
my_parent_flow
is triggered on instance_1. does this mean each subflow is executed concurrently on the same instance (instance_1)? or, does this mean each subflow executes on instance_2...instance_n as those instances are available?
if it's the former, doesn't that mean that the memory requirements of the sole instance running parent flow might need to be particularly ginormous? if it's the latter (this is ideal), how do the subflows actually get triggered? do subflow runs behind the scenes make a call to the orion server and get scheduled via a work queue? how do parameters get passed in that case? or, let's say that i have an EKS cluster, do they somehow bypass needing to be scheduled via orion and somehow get passed around internally?
for context, our parent flow fans out a bit (with a bunch of subflows, which possibly call other subflows). it would be...not great...if it all was executing on the one instance that happened to be the first one that picked up the flow run off the work queue
n
Hey @Jai P - at the time, subflows will run on the same infrastructure as the parent
@flow
that invokes them Here's a similar discussion where I show an example of how can we can currently orchestrate instances of flow runs from deployments on their own infra (using the client), the caveat being that there's no explicit link between parent and children (however you could prefix child flow run names with the parent name if that were helpful) We're in the process of expanding our support for parallel subflows on their own infra, so stay tuned!
🔥 1
j
hmmm, i think i understand but lemme make sure i truly do get the current state of the world: let's say i have
my_parent_flow
kicked off in a container, that means all subflows are kicked off within the same container unless I explicitly make the network call to kick off other flow runs (which would then be picked up by whatever containers are available to pick those up in, possibly in the same k8s pod, possibly in a separate cluster, etc.?
building on that: does that mean the task cache is per-instance, and not per-deployment? meaning if my parent flow kicks off in instance 1 and uses the above method to kick off subflows in instance_1 and instance_2 (both of which maybe rely on the same task that makes a network call), they're both going to end up making said network call?
n
1. Yes this is correct, you can always `create_flow_run_from_deployment` in an async task with the `client` to trigger flows on other infra 2. I'm not exactly sure what you mean by task-cache, however if tasks in the flow you're kicking off are written to make a network call, then
instance_1
and
instance_2
would each make those network calls independently according to the parameters you passed these flows and in turn, the task actually making the call (e.g. running a
get_api_endpoint
task in each flow with different payloads according to the params you passed
instance_1
and
instance_2
)
j
ah so maybe i can clarify my example a little bit: let's say i have:
Copy code
@task(cache_key_fn=task_input_hash)
def some_task(input):
    return f"some_task_{input}"
   
@flow
def my_flow(params):
    some_task(params["input"])
and then i have one flow run of
my_flow
that executes on
instance_1
, and then another flow run of
my_flow
10 minutes later that happens to execute on
instance_2
with the exact same parameters, will
some_task
need to actually execute? or, will the cached result from the earlier flow run be used? basically, where do cached results of a task get stored? in memory to the instance that ran the task, or somewhere separate so that any instance can use the result of a previous run of that task?
n
ahh yeah that's a good question - at the moment your task results can only be cached locally (with respect to the machine calling the task) - other task cache storage options will be available down the road, not sure I could give a timeline so if you ran
some_task
on the same machine multiple times, it'd be cached after the first invocation and available to pull for subsequent calls on the same machine with the same params. However, if you tried calling
some_task
on a different machine with the same params, you can't currently pull results from that previous invocation on the first machine
j
👍 got it. will keep an eye out for any movement on that front, and in the meantime i think we can likely set up our use case accordingly. thanks!
👍 1
upvote 1