Anish S
09/08/2025, 1:27 PMMarvin
09/08/2025, 1:27 PMMarvin
09/08/2025, 1:34 PMtail -f /dev/null
, sleep infinity
, or starts a long-running process (e.g., a worker), the container will not exit even after the Prefect engine finishes.
- In Prefect 3.x with the ECS worker (prefect-aws), the task’s container must run and exit the Prefect engine command. The worker injects a command that executes the flow run (e.g., prefect flow-run execute ...
). If your image ENTRYPOINT wraps that and keeps running, ECS will keep the task alive.
- Fixes:
- Prefer Prefect’s base images or ensure your Dockerfile ENTRYPOINT/CMD do not keep the container alive after the command finishes.
- Verify the running task’s “Command” and “Entry point” in the ECS console for the launched task. It should be a one-shot Prefect command that exits. If you see a long-running command (e.g., prefect worker start
, a web server, tail -f
), that’s the cause.
2) A sidecar or multiple containers keep the task running
- ECS will keep a task RUNNING as long as an essential container is running. If your main Prefect container exits but another essential sidecar (e.g., Datadog/Fluent Bit/CloudWatch agent) is still up, the task stays alive.
- Fixes:
- Mark only your main Prefect container as essential; non-essential sidecars won’t keep the task running after the main container exits.
- Alternatively, ensure the sidecar is configured to exit when the main container finishes.
- AWS docs on task container behavior: ECS task definitions (container behavior)
3) Background threads, async tasks, or child processes in your flow code
- Non-daemon threads, open executors, or subprocesses can keep Python alive after your flow function returns. Prefect can report the flow as Finished (because the engine reached its end), while the interpreter is still waiting on non-daemon work.
- Common culprits and fixes:
- ThreadPoolExecutor / ProcessPoolExecutor: call executor.shutdown(wait=True)
before returning.
- Multiprocessing pools: pool.close(); pool.join()
.
- Subprocesses: p.wait()
or gracefully p.terminate()
then p.wait()
.
- Dask: client.close()
(and cluster.close()
if you created it).
- Ray: ray.shutdown()
.
- Async tasks: ensure all background tasks are awaited or cancelled before exit.
- Simple diagnostic: at the end of your flow, log live threads and processes:
import atexit, threading, subprocess, os, psutil
@atexit.register
def report_lingers():
print("Alive threads:", [t.name for t in threading.enumerate() if t.is_alive()])
try:
import psutil
p = psutil.Process(os.getpid())
print("Child procs:", [c.cmdline() for c in p.children(recursive=True)])
except Exception as e:
print("psutil not available:", e)
4) Prefect 2.x ECSTask (if you’re not on 3.x)
- If you’re on Prefect 2 with the ECSTask infrastructure block, similar principles apply: the container must run a one-shot command and exit. Misconfigured ENTRYPOINT/CMD or essential sidecars are the most common reasons for lingering tasks.
- If possible, consider upgrading to Prefect 3 and the ECS worker. Docs: prefect-aws collection (see ECS Worker section).
Quick triage checklist
- In the AWS ECS console, open the stuck task:
- Check the “Containers” tab: Are there multiple containers? Which are marked Essential? Did the Prefect container exit with code 0 while another essential container is still running?
- Check “Command” and “Entry point” for the Prefect container: Is it a one-shot Prefect command, or something long-running?Marvin
09/08/2025, 1:34 PMpython -c "print('ok')"
(or sh -c "echo ok"
). If the task still doesn’t stop, your task definition (ENTRYPOINT/sidecar/essential flags) is the issue.
Next steps / info I need
- Prefect version (2.x or 3.x) and versions of prefect/prefect-aws.
- Are you using the Prefect 3 ECS worker or a Prefect 2 ECSTask block?
- Task definition details:
- Container list with Essential flags
- Entry point and Command for the main container
- Any sidecars (Datadog/Fluent Bit/etc.)
- A snippet of the flow code around any threads, subprocesses, Dask/Ray usage.
With those details, I can pinpoint the exact cause and give you a targeted fix. In the meantime, the most common fix is to remove/override any long-running ENTRYPOINT/CMD and ensure only the main Prefect container is essential so the task stops when it exits.Anish S
09/08/2025, 2:11 PMMarvin
09/08/2025, 2:11 PMMarvin
09/08/2025, 2:14 PMas_subflow=True
which links the child run to the parent in the UI, but does not inherently keep the parent process alive.
- timeout=0
means “do not wait”; it returns immediately after scheduling the child run. It should not block the container.
Given that, the most likely causes are:
1) The container process isn’t actually exiting (e.g., a lingering thread/process in your flow code)
2) An unexpected command/entrypoint override in the work pool/job template that keeps the container alive
Targeted checks
- In the ECS console for a stuck task:
- Open the task and look at the “Containers” tab. Confirm there is exactly one container and it’s Essential.
- Inspect the “Command” and “Entry point” for the running task. For Prefect’s ECS worker, this should be a one-shot Prefect engine command (e.g., a python -m prefect …/prefect flow-run execute …). If you see anything that keeps the process alive (bash -lc "… && tail -f /dev/null" or sleep infinity), that’s the cause.
- Confirm which run’s container is lingering:
- Is it the parent flow’s ECS task or a child flow’s task? This will tell us if the issue is in the parent flow’s code or in the child deployments’ code/templates.
- Enable debug logs on the flow container:
- Set PREFECT_LOGGING_LEVEL=DEBUG
so you can see shutdown messages. If you see the “Completed flow run” log but the container keeps running, something is keeping the Python interpreter alive.
Common code-level causes that keep Python alive
- Background executors/processes:
- ThreadPoolExecutor/ProcessPoolExecutor: call shutdown(wait=True)
before the flow returns
- Multiprocessing Pools: pool.close(); pool.join()
- Subprocesses: ensure you wait()
(or terminate()
then wait()
)
- Dask: client.close()
(and cluster.close()
if you created it)
- Ray: ray.shutdown()
- Async tasks: cancel/await any background tasks before returning
- Quick diagnostic snippet at the end of your parent flow to reveal lingerers:
import atexit, threading, os
try:
import psutil
except Exception:
psutil = None
@atexit.register
def report_lingers():
print("Alive threads at exit:", [t.name for t in threading.enumerate() if t.is_alive()])
if psutil:
try:
p = psutil.Process(os.getpid())
print("Child processes at exit:", [c.cmdline() for c in p.children(recursive=True)])
except Exception as e:
print("psutil failed:", e)
Work pool / job template checks (very important)
- In the ECS work pool’s job template (in UI or when created via CLI), verify you did not set a custom command
that wraps the Prefect engine with a long-running tail/sleep for debugging.
- If you used a template snippet from somewhere, remove any keep-alive patterns.
Version check
- Please try upgrading to the latest Prefect and prefect-aws to rule out any fixed regressions:
- Prefect: Prefect releases
- Prefect AWS collection: prefect-aws docs
Optional tweaks to try
- If you don’t care about parent/child UI linkage, try as_subflow=False
in run_deployment
to see if behavior changes:
run = run_deployment(
name=anatomy_threshold_subflow,
parameters={...},
timeout=0,
as_subflow=False,
)
- Create a minimal test deployment that just logs “hello” and returns. If that task stops cleanly, the issue is likely a lingering thread/process in your real flow code.
Helpful referencesMarvin
09/08/2025, 2:14 PMAnish S
09/08/2025, 9:55 PMMarvin
09/08/2025, 9:55 PMMarvin
09/08/2025, 9:56 PMexecutor.shutdown(wait=True)
before your flow returns
- multiprocessing Pool: pool.close(); pool.join()
- Subprocesses: ensure proc.wait()
(or proc.terminate()
then wait()
)
- Distributed libs:
- Dask: client.close()
and cluster.close()
if you created one
- Ray: ray.shutdown()
- Async clients/tasks:
- aiohttp/httpx: `await client.aclose()`; cancel/await any background tasks
- Data-dependent paths:
- Sometimes only certain inputs leave an executor or subprocess running; that matches your “some subflows” description
Quick diagnostics to add to the CHILD flow (the stuck one)
- Add this once at module import time to dump any lingering threads/processes at interpreter exit:
import atexit, threading, os
try:
import psutil
except Exception:
psutil = None
@atexit.register
def report_lingers():
print("Alive threads at exit:", [t.name for t in threading.enumerate() if t.is_alive()])
if psutil:
try:
p = psutil.Process(os.getpid())
print("Child processes at exit:", [c.cmdline() for c in p.children(recursive=True)])
except Exception as e:
print("psutil failed:", e)
- Optional, for deeper one-off debugging:
import sys, traceback
@atexit.register
def dump_stacks():
for tid, frame in sys._current_frames().items():
print(f"Thread {tid} stack:\n{''.join(traceback.format_stack(frame))}")
Other checks
- Confirm the actual Command on the stuck ECS task in the AWS console. Even with no Dockerfile ENTRYPOINT/CMD, the work pool/job template supplies the command. It should be a one-shot Prefect command (e.g., prefect flow-run execute ...
). If it’s wrapped with a keep-alive (e.g., ... && tail -f /dev/null
) that will prevent exit.
- Verify whether the stuck task belongs to the parent flow or the child deployment. From your description, it sounds like the CHILD flow task is the one lingering — that points squarely at child flow code leaving something running.
Version nudge
- If possible, upgrade to the latest Prefect 3.x and prefect-aws; even if unlikely to be the root cause here, it’s good to rule out fixes already shipped:
- Prefect releases: Prefect releases
- Prefect AWS docs: prefect-aws collection
- run_deployment API (for reference): run_deployment docs
Next, can you share:
- Whether the lingering task is the child flow (flow run ID helps)
- The ECS task’s Command from a stuck run
- Any use of ThreadPool/ProcessPool/multiprocessing/Ray/Dask/subprocess in the child flow
- The last ~50 lines of logs from the stuck container
With that, I can point to the exact line(s) to change. In most cases we find one executor/subprocess that needs an explicit close/shutdown before the flow returns.Anish S
09/09/2025, 1:41 PMMarvin
09/09/2025, 1:41 PMMarvin
09/09/2025, 1:41 PMMarvin
09/09/2025, 1:42 PMimport subprocess, json
def run_blender_export(params: dict):
cmd = [
"blender", "--background", "--factory-startup",
"--python", "/app/export_script.py", "--", json.dumps(params)
]
subprocess.run(cmd, check=True)
- Call this from a Prefect task/flow, then return. Since the heavy C++ teardown happens in the child process, your main process can exit cleanly.
2) Wrap the Blender work in a separate Python process
- If you must use Python logic around it, isolate it with multiprocessing so the child process terminates independently.
import multiprocessing as mp
def _blender_job(params):
import subprocess
subprocess.run([...], check=True)
def blender_export(params):
p = mp.Process(target=_blender_job, args=(params,))
p.start()
p.join()
if p.exitcode != 0:
raise RuntimeError("Blender export failed")
3) As a last resort: a hard exit at the very end of the child flow
- Only if 1/2 are not feasible, and only in affected child flows:
from prefect import flow
import os, sys
@flow
def child_flow(...):
try:
# your work including Blender
...
finally:
sys.stdout.flush(); sys.stderr.flush()
os._exit(0) # hard kill to avoid hanging in native teardowns
- Caveat: this bypasses Python/Prefect atexit hooks. Since the Completed state is already reported before return, it’s generally safe but you may lose some last log lines.
Helpful diagnostics (to confirm)
- Add this to the child flow module to see what’s alive at exit:
import atexit, threading, os
try:
import psutil
except Exception:
psutil = None
@atexit.register
def report_lingers():
print("Alive threads at exit:", [t.name for t in threading.enumerate() if t.is_alive()])
if psutil:
try:
p = psutil.Process(os.getpid())
print("Child procs at exit:", [c.cmdline() for c in p.children(recursive=True)])
except Exception as e:
print("psutil failed:", e)
A couple quick confirmations that would help me tailor this precisely:
- Are you invoking Blender (bpy or blender --background
) for the STL→glTF export?
- If so, are you importing bpy in-process or shelling out to blender
?
- If you try option 1 (CLI subprocess) or 2 (multiprocessing wrapper), does the ECS task now stop cleanly?
If you share how you’re currently calling Blender, I can give an exact snippet adapted to your setup.Marvin
09/09/2025, 1:42 PMfrom prefect import flow
import subprocess, sys, os, textwrap
def run_bpy_snippet(snippet: str, timeout: int = 600):
code = textwrap.dedent(f"""
import sys, os
import bpy
# Your bpy logic
{snippet}
# Try to aggressively clean up before exit
try:
# Clear handlers
for handler_list_name in dir(bpy.app.handlers):
handler_list = getattr(bpy.app.handlers, handler_list_name, None)
if isinstance(handler_list, list):
for h in handler_list[:]:
try:
handler_list.remove(h)
except Exception:
pass
# Reset to factory settings to drop data-blocks
bpy.ops.wm.read_factory_settings(use_empty=True)
except Exception as e:
print("bpy cleanup error:", e)
# As we are in a subprocess, it's safe to hard-exit to avoid lingering threads
import os
os._exit(0)
""")
# Launch a separate Python process that imports bpy
proc = subprocess.Popen(
[sys.executable, "-c", code],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
env=os.environ.copy(),
)
try:
for line in proc.stdout:
print(line, end="")
rc = proc.wait(timeout=timeout)
except subprocess.TimeoutExpired:
proc.kill()
raise RuntimeError("bpy subprocess timed out and was killed")
if rc != 0:
raise RuntimeError(f"bpy subprocess exited with code {rc}")
@flow
def child_flow(...):
run_bpy_snippet("""
# Example: your glTF export goes here
# e.g., bpy.ops.export_scene.gltf(filepath="/tmp/out.gltf", ...)
""")
- Key idea: bpy runs in a child process; after export we force termination with os._exit(0)
so any native threads can’t keep it alive.
- If you have the blender
binary available in the image, an alternative is:
subprocess.run(["blender", "-b", "-P", "/path/to/script.py"], check=True)
but the pip bpy
-as-a-module approach above avoids needing the binary.
2) Try to explicitly tear down bpy in-process (may or may not work)
- Before your flow returns, do:
- bpy.ops.wm.read_factory_settings(use_empty=True)
to clear data
- Unregister all bpy.app.handlers.*
- Cancel timers if you registered any via bpy.app.timers
- Then call bpy.ops.wm.quit_blender()
as a last step
- In practice, with bpy-as-a-module, lingering native threads may still prevent exit — which is why the subprocess isolation above is the most reliable.
3) As a last resort only
- If you cannot isolate bpy into a subprocess, you can hard-exit the entire process at the very end of the child flow with os._exit(0)
. This will end logs abruptly and can interfere with Prefect writing final state; only use this if you’ve confirmed the state is set and you accept the trade-offs. Prefer the subprocess approach where you only hard-exit the bpy worker process.
Why run_deployment(timeout=0) is not the causeMarvin
09/09/2025, 1:42 PMtimeout=0
just means the parent doesn’t wait on the child run; it does not keep the container alive. Your logs confirm the engine finished but the interpreter didn’t exit — consistent with bpy leaving threads running.
Next steps I recommend
- Wrap your bpy usage in a subprocess as shown and redeploy the child flow.
- If you want a quick confirmation, make a minimal child flow that only imports bpy and exits; if the ECS task lingers, it’s 100% the embedded Blender runtime keeping threads alive.
References
- run_deployment API: API docs
- Prefect AWS ECS worker: prefect-aws docs
If you can share your Blender/bpy version and a minimal snippet of the bpy export code, I can tailor the cleanup block further (handlers/timers specific to your exporters).