Hello! I am using `run_deployment` to run sub-flows on their own pod/node when the parent flow is de...
j
Hello! I am using
run_deployment
to run sub-flows on their own pod/node when the parent flow is deployed on my Kubernetes work pool (in EKS). However, I'm noticing when the cluster scales up to support the scheduling of these new pods/nodes, prefect thinks these sub-flows crashed (when eventually they start and complete successfully) causing the parent flow to move on when it should be blocked. I'm just looking for a solid way for sub-flows to be scheduled on their own pods/nodes (due to heavy resources required). Is there a preferred way to do this? I've scoured the docs/discourse/faq to no avail... Here's a torn-down version of a parent-flow i'm working on:
Copy code
@flow(log_prints=True)
async def my_flow(config_path: str, local=False):
    my_things = [ . . . ]

    if local:
        # local invocation -> process things sequentially
        for thing in things:
            await my_subflow(config_path, thing.id)
    else:
        # deployed invocation -> each subflow gets its own pod
        await asyncio.gather(
            *[
                run_deployment(
                    name="my-deployed-subflow",
                    parameters={"config_path": config_path, "id": thing.id},
                )
                for thing in things
            ]
        )
n
interesting, I think you have chosen a decent approach here
However, I'm noticing when the cluster scales up to support the scheduling of these new pods/nodes, prefect thinks these sub-flows crashed (when eventually they start and complete successfully) causing the parent flow to move on when it should be blocked.
it sounds like something is amiss here, whether due to the kubernetes worker implementation or resource allocation in your cluster somehow - do you have any more info you can offer about what you're seeing here?
j
Absolutely! We are using a prefect kubernetes work pool that I just created using the prefect UI (we host our own prefect server, prefect postgres db, and prefect worker that talks to said work pool in the cluster as well). It does seem that this "crashing" behavior isn't just particular to sub-flows, even when I start a flow-run from a deployment (using the UI) when the cluster scales up the flow run's initial state is "Crashed" but then moves onto running shortly after. I am also setting the k8s resource requirements/limits on the work pool in such a way that new pods get assigned to new nodes.
Here's what the logs look like up until the flow starts running normally
i can also post the manifests for the prefect server and prefect worker if that's helpful...
n
hrm this looks somewhat suspicious - have you explored how this might relate?
j
as far as i know, that event triggers when there isn't a viable place to put the pod. i was under the assumption that this was a requisite event for the cluster to scale up.