<@ULVA73B9P> all my flows are late, enter in Pendi...
# ask-community
m
@Marvin all my flows are late, enter in Pending state and then crash with "Flow run infrastructure exited with non-zero status code: Essential container in task exited (Error Code: 1)", I'm using the cloud's managed work pool and execution infrastructure
1
m
thought for 439.4 seconds
Thanks for the detail — that error string “Essential container in task exited (Error Code: 1)” is coming from AWS ECS/Fargate. Since you’re using Prefect Cloud’s Managed Work Pool/Managed Execution, here’s how to quickly isolate the root cause and get you unblocked. First, a couple of clarifying questions so I can tailor the fix: - Are you using the default Prefect image or a custom image in the managed work pool? - If custom, is the image public and does it include git and your Python dependencies? - Is your deployment using
flow.from_source(...)
(cloning a repo at runtime) or shipping code inside a container image? - Does your flow access AWS resources (S3, RDS, etc.) that would require credentials in env vars? If so, how are those set? Immediate diagnostics 1) Check Provider Logs in the UI - Open a failed flow run in the Prefect UI - Go to the Infrastructure/Provider logs panel - Scroll to the earliest error line; that will usually tell you exactly why the container exited 1 (e.g., ModuleNotFoundError, git: not found, permission denied, etc.) - If present, click the “View in AWS” link to open the CloudWatch log stream Docs: - ECS on Prefect (3.x) 2) Common causes for exit code 1 on managed ECS (and fixes) - Missing dependencies in the image - Symptom: ImportError/ModuleNotFoundError in provider logs - Fix: Use a custom image that installs your requirements, or ensure your deployment installs them at runtime - Missing git in the image when using from_source - Symptom: “git: not found” or clone failures in provider logs - Fix: Use a base image with git installed, e.g. extend Prefect image and add git - Example Dockerfile:
Copy code
FROM prefecthq/prefect:3-python3.11
  RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
  COPY requirements.txt .
  RUN pip install -r requirements.txt
- Private or unreachable image - Managed execution can pull public images. If your image is private or not reachable, the task may fail before your app starts - Fix: Make the image public or switch to the default Prefect image while testing - Missing runtime environment variables - Symptom: Your app exits when trying to read missing creds/config - Fix: Add env vars on the Deployment (they propagate to the task container) - Insufficient memory/CPU - Symptom: Sometimes you’ll see OOM-like messages in provider logs (true OOM often exits 137; generic Killed/exceptions may exit 1) - Fix: In your Managed Work Pool, increase CPU/Memory defaults; also ensure the Deployment doesn’t override to a lower value - Blog: More memory, more problems - Incorrect command/entrypoint override - Symptom: “python: can’t open file …” or usage messages - Fix: Revert to the default managed command, or verify your override runs your flow correctly 3) Why runs show Late then Pending then Crash - Managed compute scales from zero; cold starts can cause Late state before infra is ready - Pending indicates we’re launching the ECS task; if the container exits immediately with code 1, the run then moves to Crashed - Once we fix the container exit cause, the Late/Pending sequence will still appear sometimes with cold start, but the run will proceed to Running 4) Verify your Managed Work Pool settings - Image: default Prefect image vs custom (confirm it’s public and has git/deps if using from_source) - CPU/Memory: bump to something safe (e.g., 1 vCPU / 2–4 GB) while debugging - Environment variables: add anything your code needs (AWS creds, API URLs, etc.) 5) If your flow uses AWS resources - Managed compute can’t assume your IAM roles; you must provide credentials via env vars or another method your code supports (e.g., access keys in env) - Check the provider logs for AccessDenied or NoCredentials errors 6) Minimal repro test
- Create a tiny flow, deploy from source with the default Prefect image, and run it in the same work pool
Copy code
from prefect import flow, get_run_logger

@flow
def smoke_test():
    logger = get_run_logger()
    <http://logger.info|logger.info>("Hello from managed ECS!")

if __name__ == "__main__":
    smoke_test()
- If this succeeds, infra is healthy and the issue is likely your image/dependencies/env - If this also fails, share the provider-log snippet and we’ll dig deeper Helpful references - Prefect + AWS ECS guide (3.x) - Work pools overview blog - Memory/CPU tuning blog If you can share: - A provider-log snippet from one failed run (first error) - Your work pool’s image, CPU/Memory, and whether you’re using from_source - Whether your repo is private and whether the image includes git …I can pinpoint the fix quickly.
m
I could use some human help if anyone can shed some light here 🙇 thanks in advance
j
Hey, whats's your account id?
m
0d373dc9-4403-4748-abdb-a57870ce39cb
thanks Jake!
It's probably something stupid, but since I cannot see anything from the logs it's difficult to guess
I tried to mimic the behaviour by running a docker container with the same prefect image, then doing my pip install which targets a private github repo using a pat, but I see no errors nor warnings
j
I think I see the issue. We're actually looking to deprecate the
pip_packages
field because install errors aren't surfaced into your prefect logs.
Copy code
ERROR: Invalid requirement: '@'
It looks like you maybe have some spaces in a pip package you're specifying from git? I'd definitely recommend using pull steps, you should have full logging visibility
m
I had tried with both the space and without any spaces
I just hit another run, perhaps now the error is different
My deployments logic is written with the Python SDK, do pull steps apply here too?
let me see if the error is different
m
Ah after changing that I now see an error
hadn't got this far earlier!
j
If I'm understanding right you're pip installing your code from github (as opposed to pulling the code down directly?)
You'll likely need to change your entrypoint from script -> module syntax
m
yes, I used to do this in another (paid) account, but running a custom image that already had the repo cloned tho
j
The pull step route I linked above would: • pull your code down from github • installs your dependencies • run your file as a script but if you're installing it you'll need to change your deployment's entrypoint to a module https://docs.prefect.io/v3/api-ref/python/prefect-types-entrypoint#entrypoint
m
Mmmm I believe none of this existed when I first set up this logic so I'd need to revisit again... I currently have a github action that checks on changed flow/deployments files and deploys them using a python script:
Copy code
import os
import logging
import argparse
from dataflows.deployments._constants import (
    FLOW_DEPLOYMENT_DICT,
    DEPLOYMENT_FILE_FUNC_DICT,
)


def get_file_name_file_extension(file_path: str) -> tuple:
    """Function to extract the file name and extension from a file URI
    Args:
        file_uri (str): URL pointing to the file
    Returns:
        tuple: file name and extension
    """
    file_name_with_ext = os.path.basename(file_path)
    file_name, file_ext = os.path.splitext(file_name_with_ext)
    return (file_name, file_ext)


def get_deploy_function(file_name: str) -> callable:
    """Function to get the deployment function for a flow

    Args:
        flow (str): flow file name without extension

    Returns:
        callable: deployment function
    """
    deploy_function = FLOW_DEPLOYMENT_DICT.get(file_name)

    if not deploy_function:
        deploy_function = DEPLOYMENT_FILE_FUNC_DICT.get(file_name)
    return deploy_function


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument(
        "-f",
        "--file_path",
        help="File path",
        required=True,
    )

    args = parser.parse_args()
    logger = logging.getLogger()
    logger.setLevel(<http://logging.INFO|logging.INFO>)

    file_name, _ = get_file_name_file_extension(args.file_path)

    deploy_function = get_deploy_function(file_name)

    if deploy_function:
        <http://logger.info|logger.info>(f"Deploying {file_name} using {deploy_function.__name__}")
        deploy_function()
        <http://logger.info|logger.info>(f"{file_name} has been deployed!")
    else:
        <http://logger.info|logger.info>(
            f"No deployment function found for {file_name},"
            "skipping..."
        )
this is old, but works fine in our startup account so thought I'd use the same strategy for a personal project of mine - this is why I'd stick to python SDK for deploying flows instead of the YAML approach
if it's a matter of changing the entrypoint format and that's straightforward I'd stick to that - not sure how/where this is specified tho
ah indeed using
entrypoint_type=EntrypointType.MODULE_PATH
got me to the next step, now it's having a hard time finding a custom block that I had created
Copy code
Finished in state Failed('Flow run encountered an exception. ValueError: Unable to find block document named gofit-credentials for block type secret')
j
nice! Not sure how/where you're loading the block but it looks like it's expecting a secret block with that slug, but it looks like you have that as a custom block type
m
After switching the custom blocks for standard ones (not ideal but I can live with that) it worked just fine, the key was to change the entrypoint as you pointed out Jake, thanks a lot!!
🙌 1