<@ULVA73B9P> Flow run reaches a finished state. Ho...
# ask-marvin
a
@Marvin Flow run reaches a finished state. However, ECS container running the flow run does not get killed.
m
thought for 414.0 seconds
Got it — the flow run shows Finished in Prefect, but the ECS task (container) keeps running. This usually comes down to one of these categories: 1) The container command/entrypoint keeps the process alive - If your image sets an ENTRYPOINT or a final CMD like
tail -f /dev/null
,
sleep infinity
, or starts a long-running process (e.g., a worker), the container will not exit even after the Prefect engine finishes. - In Prefect 3.x with the ECS worker (prefect-aws), the task’s container must run and exit the Prefect engine command. The worker injects a command that executes the flow run (e.g.,
prefect flow-run execute ...
). If your image ENTRYPOINT wraps that and keeps running, ECS will keep the task alive. - Fixes: - Prefer Prefect’s base images or ensure your Dockerfile ENTRYPOINT/CMD do not keep the container alive after the command finishes. - Verify the running task’s “Command” and “Entry point” in the ECS console for the launched task. It should be a one-shot Prefect command that exits. If you see a long-running command (e.g.,
prefect worker start
, a web server,
tail -f
), that’s the cause. 2) A sidecar or multiple containers keep the task running - ECS will keep a task RUNNING as long as an essential container is running. If your main Prefect container exits but another essential sidecar (e.g., Datadog/Fluent Bit/CloudWatch agent) is still up, the task stays alive. - Fixes: - Mark only your main Prefect container as essential; non-essential sidecars won’t keep the task running after the main container exits. - Alternatively, ensure the sidecar is configured to exit when the main container finishes. - AWS docs on task container behavior: ECS task definitions (container behavior) 3) Background threads, async tasks, or child processes in your flow code - Non-daemon threads, open executors, or subprocesses can keep Python alive after your flow function returns. Prefect can report the flow as Finished (because the engine reached its end), while the interpreter is still waiting on non-daemon work. - Common culprits and fixes: - ThreadPoolExecutor / ProcessPoolExecutor: call
executor.shutdown(wait=True)
before returning. - Multiprocessing pools:
pool.close(); pool.join()
. - Subprocesses:
p.wait()
or gracefully
p.terminate()
then
p.wait()
. - Dask:
client.close()
(and
cluster.close()
if you created it). - Ray:
ray.shutdown()
. - Async tasks: ensure all background tasks are awaited or cancelled before exit. - Simple diagnostic: at the end of your flow, log live threads and processes:
Copy code
import atexit, threading, subprocess, os, psutil

  @atexit.register
  def report_lingers():
      print("Alive threads:", [t.name for t in threading.enumerate() if t.is_alive()])
      try:
          import psutil
          p = psutil.Process(os.getpid())
          print("Child procs:", [c.cmdline() for c in p.children(recursive=True)])
      except Exception as e:
          print("psutil not available:", e)
4) Prefect 2.x ECSTask (if you’re not on 3.x) - If you’re on Prefect 2 with the ECSTask infrastructure block, similar principles apply: the container must run a one-shot command and exit. Misconfigured ENTRYPOINT/CMD or essential sidecars are the most common reasons for lingering tasks. - If possible, consider upgrading to Prefect 3 and the ECS worker. Docs: prefect-aws collection (see ECS Worker section). Quick triage checklist - In the AWS ECS console, open the stuck task: - Check the “Containers” tab: Are there multiple containers? Which are marked Essential? Did the Prefect container exit with code 0 while another essential container is still running? - Check “Command” and “Entry point” for the Prefect container: Is it a one-shot Prefect command, or something long-running?
- Look at the container logs at the end: do you see Prefect logs like “Completed flow run…” followed by nothing? That suggests the engine finished but the process didn’t exit. - If you’re using the Prefect ECS worker (recommended in 3.x): - Ensure you’re on current versions of Prefect and prefect-aws. - Avoid custom ENTRYPOINTs that never exit. Let the worker override the command for the flow-run container. - Run a sanity test with your image: - Launch the same task definition but override the command to
python -c "print('ok')"
(or
sh -c "echo ok"
). If the task still doesn’t stop, your task definition (ENTRYPOINT/sidecar/essential flags) is the issue. Next steps / info I need - Prefect version (2.x or 3.x) and versions of prefect/prefect-aws. - Are you using the Prefect 3 ECS worker or a Prefect 2 ECSTask block? - Task definition details: - Container list with Essential flags - Entry point and Command for the main container - Any sidecars (Datadog/Fluent Bit/etc.) - A snippet of the flow code around any threads, subprocesses, Dask/Ray usage. With those details, I can pinpoint the exact cause and give you a targeted fix. In the meantime, the most common fix is to remove/override any long-running ENTRYPOINT/CMD and ensure only the main Prefect container is essential so the task stops when it exits.
a
@Marvin I am using prefect and prefect-aws as here: > prefect[shell,docker,aws]==3.4.5 I am using ecs worker with ecs workpool. Dockerfile does not have any entrypoint or CMD, and I am using python3.11 as baseimage. I am not using sidecars. I am deploying subflows using "run_deployment" as follows: > anatomy_threshold_subflow = "anatomy-threshold-flow-part/anatomy-threshold-deployment" > anatomy_subflows = [] > for wu in work_units_data["work_units"]: > for preset in presets: > anatomy_subflows.append( > run_deployment( > name=anatomy_threshold_subflow, > parameters=dict( > work_unit=json.dumps(wu), > preset=preset, > ), > timeout=0 > ).id > )
m
thought for 181.6 seconds
Thanks for the details — that helps narrow it down. Given: - Prefect 3.4.5 with prefect-aws, ECS worker + ECS work pool - Image based on python:3.11, no custom ENTRYPOINT or CMD - No sidecars - Parent flow triggers many child runs via run_deployment(..., timeout=0) Key notes about run_deployment in Prefect 3.x - By default
as_subflow=True
which links the child run to the parent in the UI, but does not inherently keep the parent process alive. -
timeout=0
means “do not wait”; it returns immediately after scheduling the child run. It should not block the container. Given that, the most likely causes are: 1) The container process isn’t actually exiting (e.g., a lingering thread/process in your flow code) 2) An unexpected command/entrypoint override in the work pool/job template that keeps the container alive Targeted checks - In the ECS console for a stuck task: - Open the task and look at the “Containers” tab. Confirm there is exactly one container and it’s Essential. - Inspect the “Command” and “Entry point” for the running task. For Prefect’s ECS worker, this should be a one-shot Prefect engine command (e.g., a python -m prefect …/prefect flow-run execute …). If you see anything that keeps the process alive (bash -lc "… && tail -f /dev/null" or sleep infinity), that’s the cause. - Confirm which run’s container is lingering: - Is it the parent flow’s ECS task or a child flow’s task? This will tell us if the issue is in the parent flow’s code or in the child deployments’ code/templates. - Enable debug logs on the flow container: - Set
PREFECT_LOGGING_LEVEL=DEBUG
so you can see shutdown messages. If you see the “Completed flow run” log but the container keeps running, something is keeping the Python interpreter alive. Common code-level causes that keep Python alive - Background executors/processes: - ThreadPoolExecutor/ProcessPoolExecutor: call
shutdown(wait=True)
before the flow returns - Multiprocessing Pools:
pool.close(); pool.join()
- Subprocesses: ensure you
wait()
(or
terminate()
then
wait()
) - Dask:
client.close()
(and
cluster.close()
if you created it) - Ray:
ray.shutdown()
- Async tasks: cancel/await any background tasks before returning - Quick diagnostic snippet at the end of your parent flow to reveal lingerers:
Copy code
import atexit, threading, os
  try:
      import psutil
  except Exception:
      psutil = None

  @atexit.register
  def report_lingers():
      print("Alive threads at exit:", [t.name for t in threading.enumerate() if t.is_alive()])
      if psutil:
          try:
              p = psutil.Process(os.getpid())
              print("Child processes at exit:", [c.cmdline() for c in p.children(recursive=True)])
          except Exception as e:
              print("psutil failed:", e)
Work pool / job template checks (very important) - In the ECS work pool’s job template (in UI or when created via CLI), verify you did not set a custom
command
that wraps the Prefect engine with a long-running tail/sleep for debugging. - If you used a template snippet from somewhere, remove any keep-alive patterns. Version check - Please try upgrading to the latest Prefect and prefect-aws to rule out any fixed regressions: - Prefect: Prefect releases - Prefect AWS collection: prefect-aws docs Optional tweaks to try - If you don’t care about parent/child UI linkage, try
as_subflow=False
in
run_deployment
to see if behavior changes:
Copy code
run = run_deployment(
      name=anatomy_threshold_subflow,
      parameters={...},
      timeout=0,
      as_subflow=False,
  )
- Create a minimal test deployment that just logs “hello” and returns. If that task stops cleanly, the issue is likely a lingering thread/process in your real flow code. Helpful references
- Run deployments (API): run_deployment docs - Prefect AWS ECS worker overview: prefect-aws collection - ECS task/container behavior (essential containers): AWS ECS task definitions To pinpoint this quickly, could you share: - Whether the lingering ECS task belongs to the parent or a specific child run (flow run ID helps) - The “Command” and “Entry point” shown for that task in ECS - The last ~50 lines of logs from that container after the flow reports Completed - Whether your flow uses any pools/executors/subprocess/Ray/Dask/etc. With that, I can give you a focused fix.
a
@Marvin Normal subflow runs look like: > Finished in state Completed() prefect.flow_runs > Process for flow run 'garnet-earthworm' exited cleanly. prefect.flow_runs.runner > Completed submission of flow run '1ef27088-2930-4665-81e2-80405193ae33' prefect.flow_runs.worker > ECS task status is DEPROVISIONING. prefect.flow_runs.worker > ECS task status is STOPPED. prefect.flow_runs.worker > Container 'prefect' exited successfully. prefect.flow_runs.worker But this subflow stuck looks like follows: > Finished in state Completed() prefect.flow_runs
m
thought for 66.2 seconds
Thanks — that log difference is a strong clue. What it means - In the healthy subflows, you see: - “Finished in state Completed()” (engine finished) - “Process for flow run … exited cleanly.” (the Python process reached shutdown) - ECS status logs (DEPROVISIONING/STOPPED) - In the stuck subflow, you only see: - “Finished in state Completed()” That strongly suggests: the flow engine finished, but the Python interpreter never actually exited — something in the child flow is keeping the process alive (most commonly a non-daemon thread, a background process, an open executor, or a long-lived async task). Given you’re on Prefect 3.x with the ECS worker/work pool, no custom ENTRYPOINT/CMD, and no sidecars, the next most common causes are in the flow code. Typical culprits in child flows - Executors/processes not closed: - ThreadPoolExecutor / ProcessPoolExecutor: call
executor.shutdown(wait=True)
before your flow returns - multiprocessing Pool:
pool.close(); pool.join()
- Subprocesses: ensure
proc.wait()
(or
proc.terminate()
then
wait()
) - Distributed libs: - Dask:
client.close()
and
cluster.close()
if you created one - Ray:
ray.shutdown()
- Async clients/tasks: - aiohttp/httpx: `await client.aclose()`; cancel/await any background tasks - Data-dependent paths: - Sometimes only certain inputs leave an executor or subprocess running; that matches your “some subflows” description Quick diagnostics to add to the CHILD flow (the stuck one) - Add this once at module import time to dump any lingering threads/processes at interpreter exit:
Copy code
import atexit, threading, os
  try:
      import psutil
  except Exception:
      psutil = None

  @atexit.register
  def report_lingers():
      print("Alive threads at exit:", [t.name for t in threading.enumerate() if t.is_alive()])
      if psutil:
          try:
              p = psutil.Process(os.getpid())
              print("Child processes at exit:", [c.cmdline() for c in p.children(recursive=True)])
          except Exception as e:
              print("psutil failed:", e)
- Optional, for deeper one-off debugging:
Copy code
import sys, traceback
  @atexit.register
  def dump_stacks():
      for tid, frame in sys._current_frames().items():
          print(f"Thread {tid} stack:\n{''.join(traceback.format_stack(frame))}")
Other checks - Confirm the actual Command on the stuck ECS task in the AWS console. Even with no Dockerfile ENTRYPOINT/CMD, the work pool/job template supplies the command. It should be a one-shot Prefect command (e.g.,
prefect flow-run execute ...
). If it’s wrapped with a keep-alive (e.g.,
... && tail -f /dev/null
) that will prevent exit. - Verify whether the stuck task belongs to the parent flow or the child deployment. From your description, it sounds like the CHILD flow task is the one lingering — that points squarely at child flow code leaving something running. Version nudge - If possible, upgrade to the latest Prefect 3.x and prefect-aws; even if unlikely to be the root cause here, it’s good to rule out fixes already shipped: - Prefect releases: Prefect releases - Prefect AWS docs: prefect-aws collection - run_deployment API (for reference): run_deployment docs Next, can you share: - Whether the lingering task is the child flow (flow run ID helps) - The ECS task’s Command from a stuck run - Any use of ThreadPool/ProcessPool/multiprocessing/Ray/Dask/subprocess in the child flow - The last ~50 lines of logs from the stuck container With that, I can point to the exact line(s) to change. In most cases we find one executor/subprocess that needs an explicit close/shutdown before the flow returns.
a
@Marvin The last 50 logs are here: > Timer 'STL Import' took 1441.21 ms > 150841 | INFO: Starting glTF 2.0 export > 150841 | INFO: Extracting primitive: _label_1.003 > 150841 | INFO: Primitives created: 1 > 150841 | INFO: Extracting primitive: _label_1.004 > 150841 | INFO: Primitives created: 1 > 150841 | INFO: Extracting primitive: _label_1.005 > 150841 | INFO: Primitives created: 1 > 150841 | INFO: Extracting primitive: _label_1.009 > 150841 | INFO: Primitives created: 1 > 150841 | INFO: Extracting primitive: _label_1.012 > 150841 | INFO: Primitives created: 1 > 150841 | INFO: Finished glTF 2.0 export in 0.8797309398651123 s > Error: Not freed memory blocks: 68004, total unfreed memory 181.179909 MB > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. > Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom. Also I am using blender's bpy along with prefect. FYI.
m
thought for 75.1 seconds
thought for 73.6 seconds
Those last lines look exactly like Blender’s glTF exporter + Blender’s memory leak detector on shutdown: - “INFO: Starting glTF 2.0 export / Finished glTF 2.0 export” - “Error: Not freed memory blocks … Freeing memory after the leak detector has run…” What this implies - Your child flow finishes (Prefect reports Completed), but the Python interpreter does not exit because native code (Blender/bpy or a C++ lib used by it) keeps the process alive during teardown (static destructors, background threads, etc.). That’s why you don’t see “Process for flow run … exited cleanly” or the ECS DEPROVISIONING/STOPPED lines. Recommended fixes (pick one) 1) Run Blender in a separate process and wait - Don’t import/use bpy in the same Python process as the Prefect engine. Instead call Blender’s CLI in a subprocess and wait for it to complete.
Copy code
import subprocess, json

def run_blender_export(params: dict):
    cmd = [
        "blender", "--background", "--factory-startup",
        "--python", "/app/export_script.py", "--", json.dumps(params)
    ]
    subprocess.run(cmd, check=True)
- Call this from a Prefect task/flow, then return. Since the heavy C++ teardown happens in the child process, your main process can exit cleanly. 2) Wrap the Blender work in a separate Python process - If you must use Python logic around it, isolate it with multiprocessing so the child process terminates independently.
Copy code
import multiprocessing as mp

def _blender_job(params):
    import subprocess
    subprocess.run([...], check=True)

def blender_export(params):
    p = mp.Process(target=_blender_job, args=(params,))
    p.start()
    p.join()
    if p.exitcode != 0:
        raise RuntimeError("Blender export failed")
3) As a last resort: a hard exit at the very end of the child flow - Only if 1/2 are not feasible, and only in affected child flows:
Copy code
from prefect import flow
import os, sys

@flow
def child_flow(...):
    try:
        # your work including Blender
        ...
    finally:
        sys.stdout.flush(); sys.stderr.flush()
        os._exit(0)  # hard kill to avoid hanging in native teardowns
- Caveat: this bypasses Python/Prefect atexit hooks. Since the Completed state is already reported before return, it’s generally safe but you may lose some last log lines. Helpful diagnostics (to confirm) - Add this to the child flow module to see what’s alive at exit:
Copy code
import atexit, threading, os
try:
    import psutil
except Exception:
    psutil = None

@atexit.register
def report_lingers():
    print("Alive threads at exit:", [t.name for t in threading.enumerate() if t.is_alive()])
    if psutil:
        try:
            p = psutil.Process(os.getpid())
            print("Child procs at exit:", [c.cmdline() for c in p.children(recursive=True)])
        except Exception as e:
            print("psutil failed:", e)
A couple quick confirmations that would help me tailor this precisely: - Are you invoking Blender (bpy or
blender --background
) for the STL→glTF export? - If so, are you importing bpy in-process or shelling out to
blender
? - If you try option 1 (CLI subprocess) or 2 (multiprocessing wrapper), does the ECS task now stop cleanly? If you share how you’re currently calling Blender, I can give an exact snippet adapted to your setup.
Thanks — those Blender/bpy logs are the smoking gun. What’s happening - Your flow finishes and Prefect marks it Completed, but the Python interpreter never exits because bpy leaves non-daemon threads/resources running at shutdown. - The “Not freed memory blocks…” and repeated “Freeing memory after the leak detector…” messages occur at Blender teardown. In practice, some bpy builds/plugins leave background threads (IO, image cache, exporters, etc.) alive — which prevents the process from exiting — so the ECS task never stops. Most reliable fixes (in order of robustness) 1) Run all bpy work in a separate subprocess and let the parent flow kill/wait on it - This isolates any lingering threads so the main Prefect process can exit cleanly. Example pattern in your child flow:
Copy code
from prefect import flow
import subprocess, sys, os, textwrap

def run_bpy_snippet(snippet: str, timeout: int = 600):
    code = textwrap.dedent(f"""
        import sys, os
        import bpy

        # Your bpy logic
        {snippet}

        # Try to aggressively clean up before exit
        try:
            # Clear handlers
            for handler_list_name in dir(bpy.app.handlers):
                handler_list = getattr(bpy.app.handlers, handler_list_name, None)
                if isinstance(handler_list, list):
                    for h in handler_list[:]:
                        try:
                            handler_list.remove(h)
                        except Exception:
                            pass

            # Reset to factory settings to drop data-blocks
            bpy.ops.wm.read_factory_settings(use_empty=True)
        except Exception as e:
            print("bpy cleanup error:", e)

        # As we are in a subprocess, it's safe to hard-exit to avoid lingering threads
        import os
        os._exit(0)
    """)

    # Launch a separate Python process that imports bpy
    proc = subprocess.Popen(
        [sys.executable, "-c", code],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        env=os.environ.copy(),
    )
    try:
        for line in proc.stdout:
            print(line, end="")
        rc = proc.wait(timeout=timeout)
    except subprocess.TimeoutExpired:
        proc.kill()
        raise RuntimeError("bpy subprocess timed out and was killed")
    if rc != 0:
        raise RuntimeError(f"bpy subprocess exited with code {rc}")

@flow
def child_flow(...):
    run_bpy_snippet("""
# Example: your glTF export goes here
# e.g., bpy.ops.export_scene.gltf(filepath="/tmp/out.gltf", ...)
""")
- Key idea: bpy runs in a child process; after export we force termination with
os._exit(0)
so any native threads can’t keep it alive. - If you have the
blender
binary available in the image, an alternative is:
Copy code
subprocess.run(["blender", "-b", "-P", "/path/to/script.py"], check=True)
but the pip
bpy
-as-a-module approach above avoids needing the binary. 2) Try to explicitly tear down bpy in-process (may or may not work) - Before your flow returns, do: -
bpy.ops.wm.read_factory_settings(use_empty=True)
to clear data - Unregister all
bpy.app.handlers.*
- Cancel timers if you registered any via
bpy.app.timers
- Then call
bpy.ops.wm.quit_blender()
as a last step - In practice, with bpy-as-a-module, lingering native threads may still prevent exit — which is why the subprocess isolation above is the most reliable. 3) As a last resort only - If you cannot isolate bpy into a subprocess, you can hard-exit the entire process at the very end of the child flow with
os._exit(0)
. This will end logs abruptly and can interfere with Prefect writing final state; only use this if you’ve confirmed the state is set and you accept the trade-offs. Prefer the subprocess approach where you only hard-exit the bpy worker process. Why run_deployment(timeout=0) is not the cause
-
timeout=0
just means the parent doesn’t wait on the child run; it does not keep the container alive. Your logs confirm the engine finished but the interpreter didn’t exit — consistent with bpy leaving threads running. Next steps I recommend - Wrap your bpy usage in a subprocess as shown and redeploy the child flow. - If you want a quick confirmation, make a minimal child flow that only imports bpy and exits; if the ECS task lingers, it’s 100% the embedded Blender runtime keeping threads alive. References - run_deployment API: API docs - Prefect AWS ECS worker: prefect-aws docs If you can share your Blender/bpy version and a minimal snippet of the bpy export code, I can tailor the cleanup block further (handlers/timers specific to your exporters).