Hello! I am using `run_deployment` to run sub-flows on their own pod/node when the parent flow is de...

Jack

07/30/2024, 9:14 PM

Hello! I am using

run_deployment

to run sub-flows on their own pod/node when the parent flow is deployed on my Kubernetes work pool (in EKS). However, I'm noticing when the cluster scales up to support the scheduling of these new pods/nodes, prefect thinks these sub-flows crashed (when eventually they start and complete successfully) causing the parent flow to move on when it should be blocked. I'm just looking for a solid way for sub-flows to be scheduled on their own pods/nodes (due to heavy resources required). Is there a preferred way to do this? I've scoured the docs/discourse/faq to no avail... Here's a torn-down version of a parent-flow i'm working on:

Copy code

@flow(log_prints=True)
async def my_flow(config_path: str, local=False):
    my_things = [ . . . ]

    if local:
        # local invocation -> process things sequentially
        for thing in things:
            await my_subflow(config_path, thing.id)
    else:
        # deployed invocation -> each subflow gets its own pod
        await asyncio.gather(
            *[
                run_deployment(
                    name="my-deployed-subflow",
                    parameters={"config_path": config_path, "id": thing.id},
                )
                for thing in things
            ]
        )

Nate

07/30/2024, 9:20 PM

interesting, I think you have chosen a decent approach here

However, I'm noticing when the cluster scales up to support the scheduling of these new pods/nodes, prefect thinks these sub-flows crashed (when eventually they start and complete successfully) causing the parent flow to move on when it should be blocked.

it sounds like something is amiss here, whether due to the kubernetes worker implementation or resource allocation in your cluster somehow - do you have any more info you can offer about what you're seeing here?

Jack

07/30/2024, 9:31 PM

Absolutely! We are using a prefect kubernetes work pool that I just created using the prefect UI (we host our own prefect server, prefect postgres db, and prefect worker that talks to said work pool in the cluster as well). It does seem that this "crashing" behavior isn't just particular to sub-flows, even when I start a flow-run from a deployment (using the UI) when the cluster scales up the flow run's initial state is "Crashed" but then moves onto running shortly after. I am also setting the k8s resource requirements/limits on the work pool in such a way that new pods get assigned to new nodes.

Jack

07/30/2024, 9:33 PM

Here's what the logs look like up until the flow starts running normally

Jack

07/30/2024, 10:09 PM

i can also post the manifests for the prefect server and prefect worker if that's helpful...

Nate

07/30/2024, 10:17 PM

hrm this looks somewhat suspicious - have you explored how this might relate?

Jack

07/30/2024, 10:34 PM

as far as i know, that event triggers when there isn't a viable place to put the pod. i was under the assumption that this was a requisite event for the cluster to scale up.

85 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.