< Marvin> this was the last message you sent me Nice thanks Prefect Community #ask-marvin

<@ULVA73B9P> this was the last message you sent me...

Saad80

10/16/2025, 5:24 PM

@Marvin this was the last message you sent me: Nice — thanks for the details. Two key points based on your output: - The head’s raylet PID is 7004 (from your ps output). But the failure happens on the worker at 172.31.25.199, so we need to inspect the raylet environment on the worker. - Even though docker works interactively on the worker, Ray is invoking the container runtime from the worker’s raylet/runtimenv-agent process. If that process’s environment cannot find docker on PATH at launch time, it’ll try podman and you see “podman: not found”. Let’s verify and fix, step by step. A) Inspect the worker’s raylet environment (172.31.25.199) Run these on the worker:

Copy code

# 1) Confirm raylet PID (on the worker)
pgrep -f raylet

# 2) Show the PATH visible to the raylet process (replace <PID> with output above)
sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH=

# 3) Sanity check your current shell
echo $PATH
which docker
id -nG

# 4) Check runtime env agent logs for clues
ls -1 /tmp/ray/session_*/logs | grep runtime_env
grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100

If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet, that explains why it can’t find docker at container launch time even though you can find it interactively. B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect), but force the task onto 172.31.25.199:

Copy code

python - << 'PY'
import ray, json, os, shutil

ray.init(address="auto")

# Make an info task
@ray.remote
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
    }

# Force the task to run on the target worker via node resource
target_ip = "172.31.25.199"
res = ray.get(info.options(resources={f"node:{target_ip}": 0.001}).remote())
print(json.dumps(res, indent=2))
PY

which_docker

comes back null here, the environment that Ray gives to workers on that node cannot see docker. C) How to fix (common causes + remedies) - PATH not present for Ray processes: - The Ray autoscaler starts ray processes with a different environment than your interactive shell. Ensure /usr/bin is on PATH in the environment used to start ray. You can enforce PATH in your cluster YAML start commands: - Example (add to worker_start_ray_commands and head_start_ray_commands):

Copy code

head_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --head --dashboard-host=0.0.0.0 --port=6379'
    worker_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --address=$RAY_HEAD_IP:6379'

Adjust flags to match your config. Re-run

ray up

after editing. - Group membership timing: - If you add ec2-user to the docker group in setup_commands, the membership won’t apply to already-running processes. Make sure Docker install + usermod happen before Ray starts, and start Ray only in a fresh session where

groups

shows docker. A simple way: - Do the docker install/usermod in setup_commands - In start_ray_commands, do not restart usermod; just start Ray. If needed, force a login shell or use

newgrp docker

before

ray start

. - Confirm docker and ECR access for the raylet: - After you fix PATH, try the same Python “with container” test again. If you hit auth issues, pre-pull the image on the worker and/or ensure the instance profile gives ECR pull permissions. D) Re-run the container test (on the head) Same script you used before; it should now succeed:

Copy code

python - << 'PY'
import os, shutil, ray, subprocess, json
IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"

print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
ray.init(address="auto")

[4:33 PM]
@ray.remote
def info():
    import os, subprocess, shutil
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
    }

print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))

print("\n--- With container runtime_env ---")
task = info.options(runtime_env={"container": {"image": IMAGE}}).remote()
print(json.dumps(ray.get(task), indent=2))
PY

E) Why "podman: not found" despite docker being installed - Ray's containerized tasks feature shells out from the raylet/runtime-env agent. If the raylet's environment cannot find docker (binary not on PATH at the time the process was started), Ray may try podman. Your interactive shell is fine, but the raylet process environment may be missing PATH entries. - Ensuring PATH (and group membership) for the process that starts Ray fixes this. Once this test works, you can use the same runtime_env in Prefect-Ray's RayTaskRunner: - Global for the flow:``` @flow(task_runner=RayTaskRunner( address="ray//<head ip>10001", init_kwargs={ "runtime_env": {"container": {"image": "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"}} }, )) def my_flow(): ... ``` - Or per section with

remote_options

. Docs - Ray: Running tasks/actors in containers: Ray docs: containerized tasks/actors - Ray Client (ray://): Ray Client - ray.init reference: ray.init API - Prefect-Ray: Prefect-Ray integration If you can paste: - The PATH seen by the worker’s raylet (proc environ), and - The output of the pinned info task on 172.31.25.199, …I can advise exactly which YAML changes to make so every autoscaled worker reliably sees docker. I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up). I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed, in the path, and the user is in the group. You were suggesting that perhaps the ray executor itself does not have access to docker.

Marvin

10/16/2025, 5:24 PM

Your message was too long, here's your message at the allowed limit: this was the last message you sent me Nice — thanks for the details Two key points based on your output - The head’s raylet PID is 7004 (from your ps output) But the failure happens on the worker at 172 31 25 199 so we need to inspect the raylet environment on the worker - Even though docker works interactively on the worker Ray is invoking the container runtime from the worker’s raylet/runtimenv-agent process If that process’s environment cannot find docker on PATH at launch time it’ll try podman and you see “podman not found” Let’s verify and fix step by step A) Inspect the worker’s raylet environment (172 31 25 199) Run these on the worker

Copy code

# 1) Confirm raylet PID (on the worker) pgrep -f raylet # 2) Show the PATH visible to the raylet process (replace <PID> with output above) sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH= # 3) Sanity check your current shell echo $PATH which docker id -nG # 4) Check runtime env agent logs for clues ls -1 /tmp/ray/session_*/logs | grep runtime_env grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100

If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet that explains why it can’t find docker at container launch time even though you can find it interactively B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect) but force the task onto 172 31 25 199

Copy code

python - << 'PY' import ray json os shutil ray init(address= auto ) # Make an info task @ray remote def info() import os shutil subprocess return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) } # Force the task to run on the target worker via node resource target_ip = 172 31 25 199 res = ray get(info options(resources={f node {target_ip} 0 001}) remote()) print(json dumps(res indent=2)) PY

which_docker

comes back null here the environment that Ray gives to workers on that node cannot see docker C) How to fix (common causes + remedies) - PATH not present for Ray processes - The Ray autoscaler starts ray processes with a different environment than your interactive shell Ensure /usr/bin is on PATH in the environment used to start ray You can enforce PATH in your cluster YAML start commands - Example (add to worker_start_ray_commands and head_start_ray_commands)

Copy code

head_start_ray_commands - 'export PATH=/usr/bin $PATH; ray stop' - 'export PATH=/usr/bin $PATH; ray start --head --dashboard-host=0 0 0 0 --port=6379' worker_start_ray_commands - 'export PATH=/usr/bin $PATH; ray stop' - 'export PATH=/usr/bin $PATH; ray start --address=$RAY_HEAD_IP 6379'

Adjust flags to match your config Re-run

ray up

after editing - Group membership timing - If you add ec2-user to the docker group in setup_commands the membership won’t apply to already-running processes Make sure Docker install + usermod happen before Ray starts and start Ray

Saad80

10/16/2025, 5:26 PM

@Marvin I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up). I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed, in the path, and the user is in the group. You were suggesting that perhaps the ray executor itself does not have access to docker. You gave me this test to run: cat test2.py import os, shutil, ray, subprocess, json IMAGE = "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) ray.init(address="auto") @ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0, } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options( runtime_env={ "container": { "image": IMAGE, # "run_options": ["--gpus=all", "--ipc=host"], # if needed } } ).remote() print(json.dumps(ray.get(task), indent=2)) here is the output: python test2.py Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-16 171452,680 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.17.173:6379... 2025-10-16 171452,690 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.17.173:8265 --- Without container --- (autoscaler +8s) Tip: use

ray status

to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +8s) Adding 1 node(s) of type ray.worker.default. { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.18.108) bash: line 0: exec: podman: not found (autoscaler +1m4s) Resized to 2 CPUs. (raylet, ip=172.31.18.108) [2025-10-16 171653,411 E 5879 5879] (raylet) worker_pool.cc:553: Some workers of the worker process(6005) have not registered within the timeout. The process is dead, probably it crashed during start. (raylet, ip=172.31.18.108) bash: line 0: exec: podman: not found (raylet, ip=172.31.18.108) [2025-10-16 171753,415 E 5879 5879] (raylet) worker_pool.cc:553: Some workers of the worker process(6019) have not registered within the timeout. The process is dead, probably it crashed during start.

Marvin

10/16/2025, 5:26 PM

Your message was too long, here's your message at the allowed limit: I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 6379 2025-10-16 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use

ray status

to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers

Saad80

10/16/2025, 5:28 PM

@Marvin I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 6379 2025-10-16 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use

ray status

Marvin

10/16/2025, 5:28 PM

Your message was too long, here's your message at the allowed limit: I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 <tel 63792025-10-16|6379 2025-10-16> 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use

ray status

Saad80

10/16/2025, 5:28 PM

@Marvin I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up) I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed in the path and the user is in the group You were suggesting that perhaps the ray executor itself does not have access to docker You gave me this test to run cat test2 py import os shutil ray subprocess json IMAGE = <http //971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231|971422704224 dkr ecr ap-south-1 amazonaws com/prefect-flows pd231> print( Driver PATH os environ get( PATH )) print( Driver which(docker) shutil which( docker )) ray init(address= auto ) @ray remote def info() import os subprocess shutil return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) python subprocess getoutput( python -V 2>&1 ) os_release subprocess getoutput( cat /etc/os-release || true ) has_nvidia_smi subprocess call( nvidia-smi >/dev/null 2>&1 shell=True) == 0 } print( \n--- Without container --- ) print(json dumps(ray get(info remote()) indent=2)) print( \n--- With container runtime_env --- ) task = info options( runtime_env={ container { image IMAGE # run_options [ --gpus=all --ipc=host ] # if needed } } ) remote() print(json dumps(ray get(task) indent=2)) here is the output python test2 py Driver PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin Driver which(docker) /usr/bin/docker 2025-10-16 17 14 52 680 INFO worker py 1540 -- Connecting to existing Ray cluster at address 172 31 17 173 <tel 63792025-10-16|6379 2025-10-16> 17 14 52 690 INFO worker py 1715 -- Connected to Ray cluster View the dashboard at <http //172 31 17 173 8265> --- Without container --- (autoscaler +8s) Tip use

ray status

Marvin

10/16/2025, 5:28 PM

✅ thought for 107.1 seconds

Marvin

10/16/2025, 5:30 PM

Thanks for running that test — it gives us exactly what we need. What it shows - Your Ray worker processes can see Docker just fine:

which_docker

/usr/bin/docker

and PATH looks good. - The failure happens when Ray tries to start a containerized worker: the raylet on the worker node tries to exec

podman

and crashes:

bash: line 0: exec: podman: not found

. Why this happens - Ray’s “container runtime” for

runtime_env={"container": {...}}

is chosen by the Ray processes (raylet), not your task or driver. - If Ray decides to use Podman, it will call

podman run ...

from the raylet. That choice is not controlled by Prefect or

init_kwargs

. - The engine selection is typically controlled by the environment of the Ray processes. You can force Docker by setting

RAY_CONTAINER_RUNTIME=docker

for the Ray head and all worker nodes before they start. What to do 1) Force Ray to use Docker on all Ray nodes - If you start Ray manually:

Copy code

# On head node
  RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0

  # On each worker node
  RAY_CONTAINER_RUNTIME=docker ray start --address='<HEAD_IP>:6379'

- If you use ray cluster launcher (ray up YAML), put the env var in your start commands:

Copy code

head_start_ray_commands:
    - RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0
  worker_start_ray_commands:
    - RAY_CONTAINER_RUNTIME=docker ray start --address=$RAY_HEAD_IP:6379

- If you use systemd, add to the unit:

Copy code

[Service]
  Environment=RAY_CONTAINER_RUNTIME=docker
  ExecStart=/usr/bin/ray start --head --port=6379 --dashboard-host=0.0.0.0

Then

systemctl daemon-reload && systemctl restart ray

. - If you’re on Kubernetes, set

env

on the Ray head/worker pod specs:

Copy code

- name: RAY_CONTAINER_RUNTIME
    value: docker

Alternative (not recommended): install Podman on the nodes or symlink

podman

docker

, but it’s better to tell Ray to use Docker explicitly. 2) Make sure nodes can pull your ECR image Even after fixing the runtime, the pull can fail without ECR auth. On each Ray node:

Copy code

aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <http://971422704224.dkr.ecr.ap-south-1.amazonaws.com|971422704224.dkr.ecr.ap-south-1.amazonaws.com>

3) Re-run your test After restarting the Ray cluster with `RAY_CONTAINER_RUNTIME=docker`:

Copy code

python test2.py

The “With container runtime_env” block should now run, and you’ll see the info from inside the container. Using Prefect with Ray containers -

init_kwargs

on Prefect’s

RayTaskRunner

only affect `ray.init(...)`; they cannot change the container runtime choice made by the raylets. - Use

remote_options

to set the per-task runtime_env container:

Copy code

from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"

@task
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
    }

@flow(task_runner=RayTaskRunner(address="auto"))
def run():
    with remote_options(runtime_env={"container": {"image": IMAGE, "run_options": ["--ipc=host"]}}):
        f = info.submit()
        return f.result()

if __name__ == "__main__":
    run()

A couple of quick checks on a failing worker node (like 172.31.18.108): - Verify Ray’s choice:

Copy code

echo $RAY_CONTAINER_RUNTIME
  which podman || echo "podman not found"
  which docker || echo "docker not found"

- If

RAY_CONTAINER_RUNTIME

is empty or says

podman

, set it to

docker

and restart Ray on that node.

Marvin

10/16/2025, 5:30 PM

Can you share how the Ray cluster is being started (ray up YAML, systemd, Kubernetes, manual shell)? I can give you the exact snippet to inject

RAY_CONTAINER_RUNTIME=docker

for your setup.

Saad80

10/16/2025, 6:27 PM

@Marvin I use the ray up YAML

Marvin

10/16/2025, 6:27 PM

✅ thought for 110.9 seconds

Marvin

10/16/2025, 6:29 PM

Got it — since you’re using ray up YAML, here’s exactly how to force Ray to use Docker (not Podman) and make your runtime_env container tasks work. Key points - Ray’s container runtime is chosen by the raylet processes, not by Prefect or

init_kwargs

. - In ray up YAML, each command runs in its own shell. Set the env var on the same line as

ray start

so it takes effect. - Ensure Docker is running on every node and that nodes can pull from ECR. Minimal YAML changes Add or update these sections in your Ray cluster config:

Copy code

setup_commands:
  # Optional: ensure docker service is running (Amazon Linux 2 / Ubuntu variants)
  - sudo systemctl enable --now docker || sudo service docker start || true

  # Optional: ECR login so raylet can pull your image on all nodes
  - aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <http://971422704224.dkr.ecr.ap-south-1.amazonaws.com|971422704224.dkr.ecr.ap-south-1.amazonaws.com>

head_start_ray_commands:
  - ray stop
  - sudo systemctl start docker || sudo service docker start || true
  # Force Docker as container runtime for raylet
  - RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0

worker_start_ray_commands:
  - ray stop
  - sudo systemctl start docker || sudo service docker start || true
  # Force Docker as container runtime for raylet
  - RAY_CONTAINER_RUNTIME=docker ray start --address=$RAY_HEAD_IP:6379

Why prefix on the same line? - ray up executes each command independently; using

export RAY_CONTAINER_RUNTIME=docker

on one line won’t persist to the next. Prefixing the env var on the

ray start

line guarantees it’s in the raylet’s environment. Apply and verify - Re-deploy:

ray up -y your-cluster.yaml

- Optional quick check on a node:

Copy code

which docker || echo "docker not found"
  which podman || echo "podman not found"

- Re-run your test script. The “With container runtime_env” block should run without the

exec: podman: not found

error. Using Prefect with Ray containers -

init_kwargs

only affects

ray.init()

on the driver; it cannot change what the raylets use to start containerized tasks. - Set the container per-task via Prefect’s `remote_options`:

Copy code

from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"

@task
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
    }

@flow(task_runner=RayTaskRunner(address="auto"))
def run():
    with remote_options(runtime_env={"container": {"image": IMAGE, "run_options": ["--ipc=host"]}}):
        return info.submit().result()

if __name__ == "__main__":
    run()

If you still see Podman after this - It likely means the raylets weren’t restarted under the env var. Make sure

ray stop

runs before

ray start

in your YAML, then

ray up -y

again. - Very rarely, different Ray versions use different heuristics; keeping the env var on the same line as

ray start

is the safest, version-agnostic approach. References - Prefect + Ray integration: Prefect-Ray docs - Ray docs: Ray documentation If you paste your cluster YAML, I can tailor the exact edits for your distro/user and make sure Docker and ECR auth are handled correctly on both head and workers.

Saad80

10/16/2025, 6:59 PM

@Marvin why not put them in the

setup_commands?

Marvin

10/16/2025, 6:59 PM

✅ thought for 185.5 seconds

Marvin

10/16/2025, 7:02 PM

Short answer: because in Prefect 3.x there is no

setup_commands

. That field existed in some Prefect 2.x infrastructure blocks and was removed in 3.x. A couple of clarifying questions so I can point you to the best option: - Which Prefect version are you on (2.x or 3.x)? - What commands are you trying to run (e.g., pip install, apt-get, custom script)? If you’re on Prefect 3.x, use one of these patterns: - Prefect YAML pull steps before your flow runs: -

pip_install_requirements

for Python deps -

run_shell_script

for arbitrary pre-run setup Example:

Copy code

pull:
    - prefect.deployments.steps.utility.pip_install_requirements:
        requirements_file: requirements.txt
    - prefect.deployments.steps.utility.run_shell_script:
        script: |
          ./prestart.sh
  deployments:
    - name: my-deployment
      entrypoint: flows/my_flow.py:flow
      work_pool:
        name: my-pool
        job_variables:
          env:
            FOO: bar

Docs: prefect.yaml, utility steps, pull steps - Containerized workloads: bake setup into your Dockerfile with

RUN

lines (best for speed/reliability). - Environment variables and runtime tweaks: set via

job_variables

on the work pool/deployment Docs: Customize job variables - Advanced: custom worker implementations can add programmatic setup in their job preparation Docs: Developing a custom worker If you are on Prefect 2.x,

setup_commands

was available on infrastructure blocks (e.g., Kubernetes, Docker, Process). If that’s your situation, I can share a 2.x example, but note that 3.x removed this in favor of the approaches above. 2.x infra concepts: Prefect 2.x infrastructure

Saad80

10/16/2025, 10:01 PM

@Marvin I still get this: Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-16 215847,815 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.19.101:6379... 2025-10-16 215847,825 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.19.101:8265 --- Without container --- (autoscaler +6s) Tip: use

ray status

to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +6s) Adding 1 node(s) of type ray.worker.default. (autoscaler +27s) Resized to 2 CPUs. { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.18.156) bash: line 0: exec: podman: not found

Marvin

10/16/2025, 10:01 PM

✅ thought for 52.4 seconds

Marvin

10/16/2025, 10:02 PM

Thanks for the logs — the key line is: (raylet, ip=172.31.18.156) bash: line 0: exec: podman: not found What’s happening - Your driver node has Docker installed (

which(docker): /usr/bin/docker

) — but the Ray worker node(s) do not. - When you set a Ray

runtime_env

with a

container

spec, Ray tries to launch workers inside a container on each node. That requires a container runtime (Docker or Podman) on every Ray node. - On at least one worker, Ray is attempting to use Podman and it isn’t installed, so it fails. Prefect pre-run steps won’t fix this because they run only where your flow (driver) runs, not on the Ray workers. How to fix Pick one of these options: 1) Use Docker for Ray containers (recommended if you already have Docker on nodes) - Ensure Docker is installed on every Ray node (head and workers), and the Ray user can run it:

Copy code

sudo yum update -y
  sudo amazon-linux-extras install docker
  sudo service docker start
  sudo usermod -aG docker ec2-user   # or the user running Ray
  docker info

- Explicitly tell Ray to use Docker: - In code:

Copy code

ray.init(
      address="auto",
      runtime_env={
        "container": {
          "image": "your-image:tag",
          "container_runtime": "docker",
          "run_options": ["--network=host"]
        }
      }
    )

- Or set on each node before starting Ray:

Copy code

export RAY_CONTAINER_RUNTIME=docker

- Restart the Ray cluster after installing/configuring Docker. 2) If you intended to use Podman - Install Podman on every Ray node and make sure it’s on PATH:

Copy code

sudo yum install -y podman
  podman info

- Or set

container_runtime: podman

explicitly in your

runtime_env["container"]

. 3) Don’t use containerized runtime_env - Remove the

runtime_env["container"]

block and run directly on the host. - Provision dependencies on all Ray nodes by: - Baking them into your base AMI or startup scripts (cloud-init/user data) - Or using Ray’s non-container

runtime_env

(e.g.,

"pip": ["pkg==x.y"]

) if that suits your needs Why not Prefect “setup” steps? - In Prefect 3.x there’s no

setup_commands

. You can use

prefect.yaml

pull steps (e.g.,

pip_install_requirements

run_shell_script

), but these only affect the process where the flow runs (the driver). They won’t install Docker/Podman or packages on separate Ray worker nodes. You must provision the Ray nodes themselves. Quick diagnostics to confirm on workers - From a worker node:

Copy code

which docker
  which podman
  docker info
  echo $PATH

- Or from Python via Ray:

Copy code

import ray, shutil, os, platform
  ray.init(address="auto")
  @ray.remote
  def check():
      return {
          "node": platform.node(),
          "docker": shutil.which("docker"),
          "podman": shutil.which("podman"),
          "path": os.environ.get("PATH"),
      }
  print(ray.get(check.remote()))

If you can share: - Your Ray version - The

ray.init(...)

snippet (especially the

runtime_env

) - How the cluster is launched (Ray cluster YAML, managed service, or custom EC2 +

ray start

) …I can tailor the exact steps and config.

4 Views

Open in Slack

Previous Next