<@ULVA73B9P> this was the last message you sent me...
# ask-marvin
s
@Marvin this was the last message you sent me: Nice — thanks for the details. Two key points based on your output: - The head’s raylet PID is 7004 (from your ps output). But the failure happens on the worker at 172.31.25.199, so we need to inspect the raylet environment on the worker. - Even though docker works interactively on the worker, Ray is invoking the container runtime from the worker’s raylet/runtimenv-agent process. If that process’s environment cannot find docker on PATH at launch time, it’ll try podman and you see “podman: not found”. Let’s verify and fix, step by step. A) Inspect the worker’s raylet environment (172.31.25.199) Run these on the worker:
Copy code
# 1) Confirm raylet PID (on the worker)
pgrep -f raylet

# 2) Show the PATH visible to the raylet process (replace <PID> with output above)
sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH=

# 3) Sanity check your current shell
echo $PATH
which docker
id -nG

# 4) Check runtime env agent logs for clues
ls -1 /tmp/ray/session_*/logs | grep runtime_env
grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100
If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet, that explains why it can’t find docker at container launch time even though you can find it interactively. B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect), but force the task onto 172.31.25.199:
Copy code
python - << 'PY'
import ray, json, os, shutil

ray.init(address="auto")

# Make an info task
@ray.remote
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
    }

# Force the task to run on the target worker via node resource
target_ip = "172.31.25.199"
res = ray.get(info.options(resources={f"node:{target_ip}": 0.001}).remote())
print(json.dumps(res, indent=2))
PY
If
which_docker
comes back null here, the environment that Ray gives to workers on that node cannot see docker. C) How to fix (common causes + remedies) - PATH not present for Ray processes: - The Ray autoscaler starts ray processes with a different environment than your interactive shell. Ensure /usr/bin is on PATH in the environment used to start ray. You can enforce PATH in your cluster YAML start commands: - Example (add to worker_start_ray_commands and head_start_ray_commands):
Copy code
head_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --head --dashboard-host=0.0.0.0 --port=6379'
    worker_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --address=$RAY_HEAD_IP:6379'
Adjust flags to match your config. Re-run
ray up
after editing. - Group membership timing: - If you add ec2-user to the docker group in setup_commands, the membership won’t apply to already-running processes. Make sure Docker install + usermod happen before Ray starts, and start Ray only in a fresh session where
groups
shows docker. A simple way: - Do the docker install/usermod in setup_commands - In start_ray_commands, do not restart usermod; just start Ray. If needed, force a login shell or use
newgrp docker
before
ray start
. - Confirm docker and ECR access for the raylet: - After you fix PATH, try the same Python “with container” test again. If you hit auth issues, pre-pull the image on the worker and/or ensure the instance profile gives ECR pull permissions. D) Re-run the container test (on the head) Same script you used before; it should now succeed:
Copy code
python - << 'PY'
import os, shutil, ray, subprocess, json
IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"

print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
ray.init(address="auto")

[4:33 PM]
@ray.remote
def info():
    import os, subprocess, shutil
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
    }

print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))

print("\n--- With container runtime_env ---")
task = info.options(runtime_env={"container": {"image": IMAGE}}).remote()
print(json.dumps(ray.get(task), indent=2))
PY
E) Why "podman: not found" despite docker being installed - Ray's containerized tasks feature shells out from the raylet/runtime-env agent. If the raylet's environment cannot find docker (binary not on PATH at the time the process was started), Ray may try podman. Your interactive shell is fine, but the raylet process environment may be missing PATH entries. - Ensuring PATH (and group membership) for the process that starts Ray fixes this. Once this test works, you can use the same runtime_env in Prefect-Ray's RayTaskRunner: - Global for the flow:``` @flow(task_runner=RayTaskRunner( address="ray//&lt;head ip&gt;10001", init_kwargs={ "runtime_env": {"container": {"image": "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"}} }, )) def my_flow(): ... ``` - Or per section with
remote_options
. Docs - Ray: Running tasks/actors in containers: Ray docs: containerized tasks/actors - Ray Client (ray://): Ray Client - ray.init reference: ray.init API - Prefect-Ray: Prefect-Ray integration If you can paste: - The PATH seen by the worker’s raylet (proc environ), and - The output of the pinned info task on 172.31.25.199, …I can advise exactly which YAML changes to make so every autoscaled worker reliably sees docker. I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up). I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed, in the path, and the user is in the group. You were suggesting that perhaps the ray executor itself does not have access to docker.
m
Your message was too long, here's your message at the allowed limit: this was the last message you sent me Nice — thanks for the details Two key points based on your output - The head’s raylet PID is 7004 (from your ps output) But the failure happens on the worker at 172 31 25 199 so we need to inspect the raylet environment on the worker - Even though docker works interactively on the worker Ray is invoking the container runtime from the worker’s raylet/runtimenv-agent process If that process’s environment cannot find docker on PATH at launch time it’ll try podman and you see “podman not found” Let’s verify and fix step by step A) Inspect the worker’s raylet environment (172 31 25 199) Run these on the worker
Copy code
# 1) Confirm raylet PID (on the worker) pgrep -f raylet # 2) Show the PATH visible to the raylet process (replace <PID> with output above) sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH= # 3) Sanity check your current shell echo $PATH which docker id -nG # 4) Check runtime env agent logs for clues ls -1 /tmp/ray/session_*/logs | grep runtime_env grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100
If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet that explains why it can’t find docker at container launch time even though you can find it interactively B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect) but force the task onto 172 31 25 199
Copy code
python - << 'PY' import ray json os shutil ray init(address= auto ) # Make an info task @ray remote def info() import os shutil subprocess return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) } # Force the task to run on the target worker via node resource target_ip = 172 31 25 199 res = ray get(info options(resources={f node {target_ip} 0 001}) remote()) print(json dumps(res indent=2)) PY
If
which_docker
comes back null here the environment that Ray gives to workers on that node cannot see docker C) How to fix (common causes + remedies) - PATH not present for Ray processes - The Ray autoscaler starts ray processes with a different environment than your interactive shell Ensure /usr/bin is on PATH in the environment used to start ray You can enforce PATH in your cluster YAML start commands - Example (add to worker_start_ray_commands and head_start_ray_commands)
Copy code
head_start_ray_commands - 'export PATH=/usr/bin $PATH; ray stop' - 'export PATH=/usr/bin $PATH; ray start --head --dashboard-host=0 0 0 0 --port=6379' worker_start_ray_commands - 'export PATH=/usr/bin $PATH; ray stop' - 'export PATH=/usr/bin $PATH; ray start --address=$RAY_HEAD_IP 6379'
Adjust flags to match your config Re-run
ray up
after editing - Group membership timing - If you add ec2-user to the docker group in setup_commands the membership won’t apply to already-running processes Make sure Docker install + usermod happen before Ray starts and start Ray
s
@Marvin I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up). I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed, in the path, and the user is in the group. You were suggesting that perhaps the ray executor itself does not have access to docker. You gave me this test to run: cat test2.py import os, shutil, ray, subprocess, json IMAGE = "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) ray.init(address="auto") @ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0, } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options( runtime_env={ "container": { "image": IMAGE, # "run_options": ["--gpus=all", "--ipc=host"], # if needed } } ).remote() print(json.dumps(ray.get(task), indent=2)) here is the output: python test2.py Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-16 171452,680 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.17.173:6379... 2025-10-16 171452,690 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.17.173:8265 --- Without container --- (autoscaler +8s) Tip: use
ray status
to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +8s) Adding 1 node(s) of type ray.worker.default. { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.18.108) bash: line 0: exec: podman: not found (autoscaler +1m4s) Resized to 2 CPUs. (raylet, ip=172.31.18.108) [2025-10-16 171653,411 E 5879 5879] (raylet) worker_pool.cc:553: Some workers of the worker process(6005) have not registered within the timeout. The process is dead, probably it crashed during start. (raylet, ip=172.31.18.108) bash: line 0: exec: podman: not found (raylet, ip=172.31.18.108) [2025-10-16 171753,415 E 5879 5879] (raylet) worker_pool.cc:553: Some workers of the worker process(6019) have not registered within the timeout. The process is dead, probably it crashed during start.
m
Your message was too long, here's your message at the allowed limit: I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 6379 2025-10-16 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use
ray status
to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers
s
@Marvin I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 6379 2025-10-16 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use
ray status
to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers
m
Your message was too long, here's your message at the allowed limit: I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 <tel 63792025-10-16|6379 2025-10-16> 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use
ray status
to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some
s
@Marvin I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 <tel 63792025-10-16|6379 2025-10-16> 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use
ray status
to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet)
m
thought for 107.1 seconds
Thanks for running that test — it gives us exactly what we need. What it shows - Your Ray worker processes can see Docker just fine:
which_docker
is
/usr/bin/docker
and PATH looks good. - The failure happens when Ray tries to start a containerized worker: the raylet on the worker node tries to exec
podman
and crashes:
bash: line 0: exec: podman: not found
. Why this happens - Ray’s “container runtime” for
runtime_env={"container": {...}}
is chosen by the Ray processes (raylet), not your task or driver. - If Ray decides to use Podman, it will call
podman run ...
from the raylet. That choice is not controlled by Prefect or
init_kwargs
. - The engine selection is typically controlled by the environment of the Ray processes. You can force Docker by setting
RAY_CONTAINER_RUNTIME=docker
for the Ray head and all worker nodes before they start. What to do 1) Force Ray to use Docker on all Ray nodes - If you start Ray manually:
Copy code
# On head node
  RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0

  # On each worker node
  RAY_CONTAINER_RUNTIME=docker ray start --address='<HEAD_IP>:6379'
- If you use ray cluster launcher (ray up YAML), put the env var in your start commands:
Copy code
head_start_ray_commands:
    - RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0
  worker_start_ray_commands:
    - RAY_CONTAINER_RUNTIME=docker ray start --address=$RAY_HEAD_IP:6379
- If you use systemd, add to the unit:
Copy code
[Service]
  Environment=RAY_CONTAINER_RUNTIME=docker
  ExecStart=/usr/bin/ray start --head --port=6379 --dashboard-host=0.0.0.0
Then
systemctl daemon-reload && systemctl restart ray
. - If you’re on Kubernetes, set
env
on the Ray head/worker pod specs:
Copy code
- name: RAY_CONTAINER_RUNTIME
    value: docker
Alternative (not recommended): install Podman on the nodes or symlink
podman
to
docker
, but it’s better to tell Ray to use Docker explicitly. 2) Make sure nodes can pull your ECR image Even after fixing the runtime, the pull can fail without ECR auth. On each Ray node:
Copy code
aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <http://971422704224.dkr.ecr.ap-south-1.amazonaws.com|971422704224.dkr.ecr.ap-south-1.amazonaws.com>
3) Re-run your test After restarting the Ray cluster with `RAY_CONTAINER_RUNTIME=docker`:
Copy code
python test2.py
The “With container runtime_env” block should now run, and you’ll see the info from inside the container. Using Prefect with Ray containers -
init_kwargs
on Prefect’s
RayTaskRunner
only affect `ray.init(...)`; they cannot change the container runtime choice made by the raylets. - Use
remote_options
to set the per-task runtime_env container:
Copy code
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"

@task
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
    }

@flow(task_runner=RayTaskRunner(address="auto"))
def run():
    with remote_options(runtime_env={"container": {"image": IMAGE, "run_options": ["--ipc=host"]}}):
        f = info.submit()
        return f.result()

if __name__ == "__main__":
    run()
A couple of quick checks on a failing worker node (like 172.31.18.108): - Verify Ray’s choice:
Copy code
echo $RAY_CONTAINER_RUNTIME
  which podman || echo "podman not found"
  which docker || echo "docker not found"
- If
RAY_CONTAINER_RUNTIME
is empty or says
podman
, set it to
docker
and restart Ray on that node.
Can you share how the Ray cluster is being started (ray up YAML, systemd, Kubernetes, manual shell)? I can give you the exact snippet to inject
RAY_CONTAINER_RUNTIME=docker
for your setup.
s
@Marvin I use the ray up YAML
m
thought for 110.9 seconds
Got it — since you’re using ray up YAML, here’s exactly how to force Ray to use Docker (not Podman) and make your runtime_env container tasks work. Key points - Ray’s container runtime is chosen by the raylet processes, not by Prefect or
init_kwargs
. - In ray up YAML, each command runs in its own shell. Set the env var on the same line as
ray start
so it takes effect. - Ensure Docker is running on every node and that nodes can pull from ECR. Minimal YAML changes Add or update these sections in your Ray cluster config:
Copy code
setup_commands:
  # Optional: ensure docker service is running (Amazon Linux 2 / Ubuntu variants)
  - sudo systemctl enable --now docker || sudo service docker start || true

  # Optional: ECR login so raylet can pull your image on all nodes
  - aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <http://971422704224.dkr.ecr.ap-south-1.amazonaws.com|971422704224.dkr.ecr.ap-south-1.amazonaws.com>

head_start_ray_commands:
  - ray stop
  - sudo systemctl start docker || sudo service docker start || true
  # Force Docker as container runtime for raylet
  - RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0

worker_start_ray_commands:
  - ray stop
  - sudo systemctl start docker || sudo service docker start || true
  # Force Docker as container runtime for raylet
  - RAY_CONTAINER_RUNTIME=docker ray start --address=$RAY_HEAD_IP:6379
Why prefix on the same line? - ray up executes each command independently; using
export RAY_CONTAINER_RUNTIME=docker
on one line won’t persist to the next. Prefixing the env var on the
ray start
line guarantees it’s in the raylet’s environment. Apply and verify - Re-deploy:
ray up -y your-cluster.yaml
- Optional quick check on a node:
Copy code
which docker || echo "docker not found"
  which podman || echo "podman not found"
- Re-run your test script. The “With container runtime_env” block should run without the
exec: podman: not found
error. Using Prefect with Ray containers -
init_kwargs
only affects
ray.init()
on the driver; it cannot change what the raylets use to start containerized tasks. - Set the container per-task via Prefect’s `remote_options`:
Copy code
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"

@task
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
    }

@flow(task_runner=RayTaskRunner(address="auto"))
def run():
    with remote_options(runtime_env={"container": {"image": IMAGE, "run_options": ["--ipc=host"]}}):
        return info.submit().result()

if __name__ == "__main__":
    run()
If you still see Podman after this - It likely means the raylets weren’t restarted under the env var. Make sure
ray stop
runs before
ray start
in your YAML, then
ray up -y
again. - Very rarely, different Ray versions use different heuristics; keeping the env var on the same line as
ray start
is the safest, version-agnostic approach. References - Prefect + Ray integration: Prefect-Ray docs - Ray docs: Ray documentation If you paste your cluster YAML, I can tailor the exact edits for your distro/user and make sure Docker and ECR auth are handled correctly on both head and workers.
s
@Marvin why not put them in the
setup_commands?
m
thought for 185.5 seconds
Short answer: because in Prefect 3.x there is no
setup_commands
. That field existed in some Prefect 2.x infrastructure blocks and was removed in 3.x. A couple of clarifying questions so I can point you to the best option: - Which Prefect version are you on (2.x or 3.x)? - What commands are you trying to run (e.g., pip install, apt-get, custom script)? If you’re on Prefect 3.x, use one of these patterns: - Prefect YAML pull steps before your flow runs: -
pip_install_requirements
for Python deps -
run_shell_script
for arbitrary pre-run setup Example:
Copy code
pull:
    - prefect.deployments.steps.utility.pip_install_requirements:
        requirements_file: requirements.txt
    - prefect.deployments.steps.utility.run_shell_script:
        script: |
          ./prestart.sh
  deployments:
    - name: my-deployment
      entrypoint: flows/my_flow.py:flow
      work_pool:
        name: my-pool
        job_variables:
          env:
            FOO: bar
Docs: prefect.yaml, utility steps, pull steps - Containerized workloads: bake setup into your Dockerfile with
RUN
lines (best for speed/reliability). - Environment variables and runtime tweaks: set via
job_variables
on the work pool/deployment Docs: Customize job variables - Advanced: custom worker implementations can add programmatic setup in their job preparation Docs: Developing a custom worker If you are on Prefect 2.x,
setup_commands
was available on infrastructure blocks (e.g., Kubernetes, Docker, Process). If that’s your situation, I can share a 2.x example, but note that 3.x removed this in favor of the approaches above. 2.x infra concepts: Prefect 2.x infrastructure
s
@Marvin I still get this: Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-16 215847,815 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.19.101:6379... 2025-10-16 215847,825 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.19.101:8265 --- Without container --- (autoscaler +6s) Tip: use
ray status
to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +6s) Adding 1 node(s) of type ray.worker.default. (autoscaler +27s) Resized to 2 CPUs. { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.18.156) bash: line 0: exec: podman: not found
m
thought for 52.4 seconds
Thanks for the logs — the key line is: (raylet, ip=172.31.18.156) bash: line 0: exec: podman: not found What’s happening - Your driver node has Docker installed (
which(docker): /usr/bin/docker
) — but the Ray worker node(s) do not. - When you set a Ray
runtime_env
with a
container
spec, Ray tries to launch workers inside a container on each node. That requires a container runtime (Docker or Podman) on every Ray node. - On at least one worker, Ray is attempting to use Podman and it isn’t installed, so it fails. Prefect pre-run steps won’t fix this because they run only where your flow (driver) runs, not on the Ray workers. How to fix Pick one of these options: 1) Use Docker for Ray containers (recommended if you already have Docker on nodes) - Ensure Docker is installed on every Ray node (head and workers), and the Ray user can run it:
Copy code
sudo yum update -y
  sudo amazon-linux-extras install docker
  sudo service docker start
  sudo usermod -aG docker ec2-user   # or the user running Ray
  docker info
- Explicitly tell Ray to use Docker: - In code:
Copy code
ray.init(
      address="auto",
      runtime_env={
        "container": {
          "image": "your-image:tag",
          "container_runtime": "docker",
          "run_options": ["--network=host"]
        }
      }
    )
- Or set on each node before starting Ray:
Copy code
export RAY_CONTAINER_RUNTIME=docker
- Restart the Ray cluster after installing/configuring Docker. 2) If you intended to use Podman - Install Podman on every Ray node and make sure it’s on PATH:
Copy code
sudo yum install -y podman
  podman info
- Or set
container_runtime: podman
explicitly in your
runtime_env["container"]
. 3) Don’t use containerized runtime_env - Remove the
runtime_env["container"]
block and run directly on the host. - Provision dependencies on all Ray nodes by: - Baking them into your base AMI or startup scripts (cloud-init/user data) - Or using Ray’s non-container
runtime_env
(e.g.,
"pip": ["pkg==x.y"]
) if that suits your needs Why not Prefect “setup” steps? - In Prefect 3.x there’s no
setup_commands
. You can use
prefect.yaml
pull steps (e.g.,
pip_install_requirements
,
run_shell_script
), but these only affect the process where the flow runs (the driver). They won’t install Docker/Podman or packages on separate Ray worker nodes. You must provision the Ray nodes themselves. Quick diagnostics to confirm on workers - From a worker node:
Copy code
which docker
  which podman
  docker info
  echo $PATH
- Or from Python via Ray:
Copy code
import ray, shutil, os, platform
  ray.init(address="auto")
  @ray.remote
  def check():
      return {
          "node": platform.node(),
          "docker": shutil.which("docker"),
          "podman": shutil.which("podman"),
          "path": os.environ.get("PATH"),
      }
  print(ray.get(check.remote()))
If you can share: - Your Ray version - The
ray.init(...)
snippet (especially the
runtime_env
) - How the cluster is launched (Ray cluster YAML, managed service, or custom EC2 +
ray start
) …I can tailor the exact steps and config.