Verun Rahimtoola

    Verun Rahimtoola

    1 year ago
    hi, we're using prefect server as our backend, and using a local agent on our infrastructure to execute flows. we're noticing that for some flows, only the first level of the dag (ie, the tasks with no upstream requirements) get executed, and then nothing happens... any clues as to what might be happening?
    the flow run is launched using the prefect UI, and i see the local agent picking up the flow for execution. it executes the topmost tasks in the dag, and then the remaining tasks simply remain stuck in the 'pending' state!
    because the flow is already set to the 'submitted' state, it doesn't show up in any future queries to the graphql backend for flow runs
    and additionally, the flow runner seems to only do one loop through the topologically sorted tasks of the flow, and then returns the flow state as 'RUNNING'. i imagine there has to be some looping going on somewhere else for the flow run, to run the remaining (downstream) tasks... such as happens within
    flow._run()
    for example
    Kyle Moon-Wright

    Kyle Moon-Wright

    1 year ago
    Hey @Verun Rahimtoola, Whenever I’ve encountered this, it usually indicates resource constraints in your execution environment if you are running a large flow/processing a lot of data OR network constraints if communication is blocked back to your GraphQL service, but I would also check to ensure you’re running the latest version of Prefect just in case. Normally, the Lazarus process should kick in to mark these tasks as Failed eventually if the tasks can reach Running, so the fact that they never reach this state leads me to believe one of the issues listed above is responsible.
    Verun Rahimtoola

    Verun Rahimtoola

    1 year ago
    hi @Kyle Moon-Wright thanks for responding! so the agent has no issue talking to the graphql server and getting back flow runs, and i can see this from the agent logs.. also the flow is fairly simple and isn't doing a whole lot 😕 i need to verify with our ops guys if the lazarus service is up and running, as i think that might be the issue... will report back. thanks!
    fyi our prefect versions everywhere are 0.14.0
    i'm not sure about our deployment specifics but i'll also ask them to verify if wherever lazarus is running from can talk to the prefect GQL server