Saad80
10/16/2025, 5:24 PM# 1) Confirm raylet PID (on the worker)
pgrep -f raylet
# 2) Show the PATH visible to the raylet process (replace <PID> with output above)
sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH=
# 3) Sanity check your current shell
echo $PATH
which docker
id -nG
# 4) Check runtime env agent logs for clues
ls -1 /tmp/ray/session_*/logs | grep runtime_env
grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100
If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet, that explains why it can’t find docker at container launch time even though you can find it interactively.
B) Pin a simple “info” task to the worker (no container) to see what that worker reports
Run this from the head (or anywhere that can connect), but force the task onto 172.31.25.199:
python - << 'PY'
import ray, json, os, shutil
ray.init(address="auto")
# Make an info task
@ray.remote
def info():
import os, shutil, subprocess
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
}
# Force the task to run on the target worker via node resource
target_ip = "172.31.25.199"
res = ray.get(info.options(resources={f"node:{target_ip}": 0.001}).remote())
print(json.dumps(res, indent=2))
PY
If which_docker comes back null here, the environment that Ray gives to workers on that node cannot see docker.
C) How to fix (common causes + remedies)
- PATH not present for Ray processes:
- The Ray autoscaler starts ray processes with a different environment than your interactive shell. Ensure /usr/bin is on PATH in the environment used to start ray. You can enforce PATH in your cluster YAML start commands:
- Example (add to worker_start_ray_commands and head_start_ray_commands):
head_start_ray_commands:
- 'export PATH=/usr/bin:$PATH; ray stop'
- 'export PATH=/usr/bin:$PATH; ray start --head --dashboard-host=0.0.0.0 --port=6379'
worker_start_ray_commands:
- 'export PATH=/usr/bin:$PATH; ray stop'
- 'export PATH=/usr/bin:$PATH; ray start --address=$RAY_HEAD_IP:6379'
Adjust flags to match your config. Re-run ray up after editing.
- Group membership timing:
- If you add ec2-user to the docker group in setup_commands, the membership won’t apply to already-running processes. Make sure Docker install + usermod happen before Ray starts, and start Ray only in a fresh session where groups shows docker. A simple way:
- Do the docker install/usermod in setup_commands
- In start_ray_commands, do not restart usermod; just start Ray. If needed, force a login shell or use newgrp docker before ray start.
- Confirm docker and ECR access for the raylet:
- After you fix PATH, try the same Python “with container” test again. If you hit auth issues, pre-pull the image on the worker and/or ensure the instance profile gives ECR pull permissions.
D) Re-run the container test (on the head)
Same script you used before; it should now succeed:
python - << 'PY'
import os, shutil, ray, subprocess, json
IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"
print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
ray.init(address="auto")
[4:33 PM]
@ray.remote
def info():
import os, subprocess, shutil
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
"python": subprocess.getoutput("python -V 2>&1"),
"os_release": subprocess.getoutput("cat /etc/os-release || true"),
}
print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))
print("\n--- With container runtime_env ---")
task = info.options(runtime_env={"container": {"image": IMAGE}}).remote()
print(json.dumps(ray.get(task), indent=2))
PY
E) Why "podman: not found" despite docker being installed
- Ray's containerized tasks feature shells out from the raylet/runtime-env agent. If the raylet's environment cannot find docker (binary not on PATH at the time the process was started), Ray may try podman. Your interactive shell is fine, but the raylet process environment may be missing PATH entries.
- Ensuring PATH (and group membership) for the process that starts Ray fixes this.
Once this test works, you can use the same runtime_env in Prefect-Ray's RayTaskRunner:
- Global for the flow:```
@flow(task_runner=RayTaskRunner(
address="ray//<head ip>10001",
init_kwargs={
"runtime_env": {"container": {"image": "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"}}
},
))
def my_flow():
...
```
- Or per section with remote_options.
Docs
- Ray: Running tasks/actors in containers: Ray docs: containerized tasks/actors
- Ray Client (ray://): Ray Client
- ray.init reference: ray.init API
- Prefect-Ray: Prefect-Ray integration
If you can paste:
- The PATH seen by the worker’s raylet (proc environ), and
- The output of the pinned info task on 172.31.25.199,
…I can advise exactly which YAML changes to make so every autoscaled worker reliably sees docker.
I am trying to get prefect to run tasks in ray but in docker containers (I have an active ray cluster head already up). I wanna do it through init_kwargs but I run into the error podman not found even though docker is installed, in the path, and the user is in the group.
You were suggesting that perhaps the ray executor itself does not have access to docker.Marvin
10/16/2025, 5:24 PM# 1) Confirm raylet PID (on the worker) pgrep -f raylet # 2) Show the PATH visible to the raylet process (replace <PID> with output above) sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH= # 3) Sanity check your current shell echo $PATH which docker id -nG # 4) Check runtime env agent logs for clues ls -1 /tmp/ray/session_*/logs | grep runtime_env grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100 If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet that explains why it can’t find docker at container launch time even though you can find it interactively B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect) but force the task onto 172 31 25 199 python - << 'PY' import ray json os shutil ray init(address= auto ) # Make an info task @ray remote def info() import os shutil subprocess return { PATH os environ get( PATH ) which_docker shutil which( docker ) whoami subprocess getoutput( whoami ) } # Force the task to run on the target worker via node resource target_ip = 172 31 25 199 res = ray get(info options(resources={f node {target_ip} 0 001}) remote()) print(json dumps(res indent=2)) PY If which_docker comes back null here the environment that Ray gives to workers on that node cannot see docker C) How to fix (common causes + remedies) - PATH not present for Ray processes - The Ray autoscaler starts ray processes with a different environment than your interactive shell Ensure /usr/bin is on PATH in the environment used to start ray You can enforce PATH in your cluster YAML start commands - Example (add to worker_start_ray_commands and head_start_ray_commands) head_start_ray_commands - 'export PATH=/usr/bin $PATH; ray stop' - 'export PATH=/usr/bin $PATH; ray start --head --dashboard-host=0 0 0 0 --port=6379' worker_start_ray_commands - 'export PATH=/usr/bin $PATH; ray stop' - 'export PATH=/usr/bin $PATH; ray start --address=$RAY_HEAD_IP 6379' Adjust flags to match your config Re-run ray up after editing - Group membership timing - If you add ec2-user to the docker group in setup_commands the membership won’t apply to already-running processes Make sure Docker install + usermod happen before Ray starts and start RaySaad80
10/16/2025, 5:26 PMray status to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +8s) Adding 1 node(s) of type ray.worker.default.
{
"PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin",
"which_docker": "/usr/bin/docker",
"whoami": "ec2-user",
"python": "Python 3.11.10",
"os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"",
"has_nvidia_smi": false
}
--- With container runtime_env ---
(raylet, ip=172.31.18.108) bash: line 0: exec: podman: not found
(autoscaler +1m4s) Resized to 2 CPUs.
(raylet, ip=172.31.18.108) [2025-10-16 171653,411 E 5879 5879] (raylet) worker_pool.cc:553: Some workers of the worker process(6005) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet, ip=172.31.18.108) bash: line 0: exec: podman: not found
(raylet, ip=172.31.18.108) [2025-10-16 171753,415 E 5879 5879] (raylet) worker_pool.cc:553: Some workers of the worker process(6019) have not registered within the timeout. The process is dead, probably it crashed during start.Marvin
10/16/2025, 5:26 PMray status to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workersSaad80
10/16/2025, 5:28 PMray status to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workersMarvin
10/16/2025, 5:28 PMray status to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> SomeSaad80
10/16/2025, 5:28 PMray status to view detailed cluster status To disable these messages set RAY_SCHEDULER_EVENTS=0 (autoscaler +8s) Adding 1 node(s) of type ray worker default { PATH /home/ec2-user/ pyenv/versions/3 11 10/bin /home/ec2-user/ pyenv/libexec /home/ec2-user/ pyenv/plugins/python-build/bin /home/ec2-user/ pyenv/plugins/pyenv-virtualenv/bin /home/ec2-user/ pyenv/plugins/pyenv-update/bin /home/ec2-user/ pyenv/plugins/pyenv-doctor/bin /home/ec2-user/ pyenv/shims /home/ec2-user/ pyenv/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /home/ec2-user/ local/bin /home/ec2-user/bin which_docker /usr/bin/docker whoami ec2-user python Python 3 11 10 os_release NAME=\ Amazon Linux\ \nVERSION=\ 2\ \nID=\ amzn\ \nID_LIKE=\ centos rhel fedora\ \nVERSION_ID=\ 2\ \nPRETTY_NAME=\ Amazon Linux 2\ \nANSI_COLOR=\ 0;33\ \nCPE_NAME=\ cpe 2 3 o amazon amazon_linux 2\ \nHOME_URL=\ <https //amazonlinux com/>\ \nSUPPORT_END=\ 2026-06-30\ has_nvidia_smi false } --- With container runtime_env --- (raylet ip=172 31 18 108) bash line 0 exec podman not found (autoscaler +1m4s) Resized to 2 CPUs (raylet ip=172 31 18 108) [2025-10-16 17 16 53 411 E 5879 5879] (raylet) <http //worker_pool cc 553|worker_pool cc 553> Some workers of the worker process(6005) have not registered within the timeout The process is dead probably it crashed during start (raylet ip=172 31 18 108) bash line 0 exec podman not found (raylet ip=172 31 18 108) [2025-10-16 17 17 53 415 E 5879 5879] (raylet)Marvin
10/16/2025, 5:28 PMMarvin
10/16/2025, 5:30 PMwhich_docker is /usr/bin/docker and PATH looks good.
- The failure happens when Ray tries to start a containerized worker: the raylet on the worker node tries to exec podman and crashes: bash: line 0: exec: podman: not found.
Why this happens
- Ray’s “container runtime” for runtime_env={"container": {...}} is chosen by the Ray processes (raylet), not your task or driver.
- If Ray decides to use Podman, it will call podman run ... from the raylet. That choice is not controlled by Prefect or init_kwargs.
- The engine selection is typically controlled by the environment of the Ray processes. You can force Docker by setting RAY_CONTAINER_RUNTIME=docker for the Ray head and all worker nodes before they start.
What to do
1) Force Ray to use Docker on all Ray nodes
- If you start Ray manually:
# On head node
RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0
# On each worker node
RAY_CONTAINER_RUNTIME=docker ray start --address='<HEAD_IP>:6379'
- If you use ray cluster launcher (ray up YAML), put the env var in your start commands:
head_start_ray_commands:
- RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0
worker_start_ray_commands:
- RAY_CONTAINER_RUNTIME=docker ray start --address=$RAY_HEAD_IP:6379
- If you use systemd, add to the unit:
[Service]
Environment=RAY_CONTAINER_RUNTIME=docker
ExecStart=/usr/bin/ray start --head --port=6379 --dashboard-host=0.0.0.0
Then systemctl daemon-reload && systemctl restart ray.
- If you’re on Kubernetes, set env on the Ray head/worker pod specs:
- name: RAY_CONTAINER_RUNTIME
value: docker
Alternative (not recommended): install Podman on the nodes or symlink podman to docker, but it’s better to tell Ray to use Docker explicitly.
2) Make sure nodes can pull your ECR image
Even after fixing the runtime, the pull can fail without ECR auth. On each Ray node:
aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <http://971422704224.dkr.ecr.ap-south-1.amazonaws.com|971422704224.dkr.ecr.ap-south-1.amazonaws.com>
3) Re-run your test
After restarting the Ray cluster with `RAY_CONTAINER_RUNTIME=docker`:
python test2.py
The “With container runtime_env” block should now run, and you’ll see the info from inside the container.
Using Prefect with Ray containers
- init_kwargs on Prefect’s RayTaskRunner only affect `ray.init(...)`; they cannot change the container runtime choice made by the raylets.
- Use remote_options to set the per-task runtime_env container:
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options
IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"
@task
def info():
import os, shutil, subprocess
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
"python": subprocess.getoutput("python -V 2>&1"),
"os_release": subprocess.getoutput("cat /etc/os-release || true"),
}
@flow(task_runner=RayTaskRunner(address="auto"))
def run():
with remote_options(runtime_env={"container": {"image": IMAGE, "run_options": ["--ipc=host"]}}):
f = info.submit()
return f.result()
if __name__ == "__main__":
run()
A couple of quick checks on a failing worker node (like 172.31.18.108):
- Verify Ray’s choice:
echo $RAY_CONTAINER_RUNTIME
which podman || echo "podman not found"
which docker || echo "docker not found"
- If RAY_CONTAINER_RUNTIME is empty or says podman, set it to docker and restart Ray on that node.Marvin
10/16/2025, 5:30 PMRAY_CONTAINER_RUNTIME=docker for your setup.Saad80
10/16/2025, 6:27 PMMarvin
10/16/2025, 6:27 PMMarvin
10/16/2025, 6:29 PMinit_kwargs.
- In ray up YAML, each command runs in its own shell. Set the env var on the same line as ray start so it takes effect.
- Ensure Docker is running on every node and that nodes can pull from ECR.
Minimal YAML changes
Add or update these sections in your Ray cluster config:
setup_commands:
# Optional: ensure docker service is running (Amazon Linux 2 / Ubuntu variants)
- sudo systemctl enable --now docker || sudo service docker start || true
# Optional: ECR login so raylet can pull your image on all nodes
- aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <http://971422704224.dkr.ecr.ap-south-1.amazonaws.com|971422704224.dkr.ecr.ap-south-1.amazonaws.com>
head_start_ray_commands:
- ray stop
- sudo systemctl start docker || sudo service docker start || true
# Force Docker as container runtime for raylet
- RAY_CONTAINER_RUNTIME=docker ray start --head --port=6379 --dashboard-host=0.0.0.0
worker_start_ray_commands:
- ray stop
- sudo systemctl start docker || sudo service docker start || true
# Force Docker as container runtime for raylet
- RAY_CONTAINER_RUNTIME=docker ray start --address=$RAY_HEAD_IP:6379
Why prefix on the same line?
- ray up executes each command independently; using export RAY_CONTAINER_RUNTIME=docker on one line won’t persist to the next. Prefixing the env var on the ray start line guarantees it’s in the raylet’s environment.
Apply and verify
- Re-deploy: ray up -y your-cluster.yaml
- Optional quick check on a node:
which docker || echo "docker not found"
which podman || echo "podman not found"
- Re-run your test script. The “With container runtime_env” block should run without the exec: podman: not found error.
Using Prefect with Ray containers
- init_kwargs only affects ray.init() on the driver; it cannot change what the raylets use to start containerized tasks.
- Set the container per-task via Prefect’s `remote_options`:
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options
IMAGE = "<http://971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231|971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231>"
@task
def info():
import os, shutil, subprocess
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
"python": subprocess.getoutput("python -V 2>&1"),
"os_release": subprocess.getoutput("cat /etc/os-release || true"),
}
@flow(task_runner=RayTaskRunner(address="auto"))
def run():
with remote_options(runtime_env={"container": {"image": IMAGE, "run_options": ["--ipc=host"]}}):
return info.submit().result()
if __name__ == "__main__":
run()
If you still see Podman after this
- It likely means the raylets weren’t restarted under the env var. Make sure ray stop runs before ray start in your YAML, then ray up -y again.
- Very rarely, different Ray versions use different heuristics; keeping the env var on the same line as ray start is the safest, version-agnostic approach.
References
- Prefect + Ray integration: Prefect-Ray docs
- Ray docs: Ray documentation
If you paste your cluster YAML, I can tailor the exact edits for your distro/user and make sure Docker and ECR auth are handled correctly on both head and workers.Saad80
10/16/2025, 6:59 PMsetup_commands?Marvin
10/16/2025, 6:59 PMMarvin
10/16/2025, 7:02 PMsetup_commands. That field existed in some Prefect 2.x infrastructure blocks and was removed in 3.x.
A couple of clarifying questions so I can point you to the best option:
- Which Prefect version are you on (2.x or 3.x)?
- What commands are you trying to run (e.g., pip install, apt-get, custom script)?
If you’re on Prefect 3.x, use one of these patterns:
- Prefect YAML pull steps before your flow runs:
- pip_install_requirements for Python deps
- run_shell_script for arbitrary pre-run setup
Example:
pull:
- prefect.deployments.steps.utility.pip_install_requirements:
requirements_file: requirements.txt
- prefect.deployments.steps.utility.run_shell_script:
script: |
./prestart.sh
deployments:
- name: my-deployment
entrypoint: flows/my_flow.py:flow
work_pool:
name: my-pool
job_variables:
env:
FOO: bar
Docs: prefect.yaml, utility steps, pull steps
- Containerized workloads: bake setup into your Dockerfile with RUN lines (best for speed/reliability).
- Environment variables and runtime tweaks: set via job_variables on the work pool/deployment
Docs: Customize job variables
- Advanced: custom worker implementations can add programmatic setup in their job preparation
Docs: Developing a custom worker
If you are on Prefect 2.x, setup_commands was available on infrastructure blocks (e.g., Kubernetes, Docker, Process). If that’s your situation, I can share a 2.x example, but note that 3.x removed this in favor of the approaches above. 2.x infra concepts: Prefect 2.x infrastructureSaad80
10/16/2025, 10:01 PMray status to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +6s) Adding 1 node(s) of type ray.worker.default.
(autoscaler +27s) Resized to 2 CPUs.
{
"PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin",
"which_docker": "/usr/bin/docker",
"whoami": "ec2-user",
"python": "Python 3.11.10",
"os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"",
"has_nvidia_smi": false
}
--- With container runtime_env ---
(raylet, ip=172.31.18.156) bash: line 0: exec: podman: not foundMarvin
10/16/2025, 10:01 PMMarvin
10/16/2025, 10:02 PMwhich(docker): /usr/bin/docker) — but the Ray worker node(s) do not.
- When you set a Ray runtime_env with a container spec, Ray tries to launch workers inside a container on each node. That requires a container runtime (Docker or Podman) on every Ray node.
- On at least one worker, Ray is attempting to use Podman and it isn’t installed, so it fails. Prefect pre-run steps won’t fix this because they run only where your flow (driver) runs, not on the Ray workers.
How to fix
Pick one of these options:
1) Use Docker for Ray containers (recommended if you already have Docker on nodes)
- Ensure Docker is installed on every Ray node (head and workers), and the Ray user can run it:
sudo yum update -y
sudo amazon-linux-extras install docker
sudo service docker start
sudo usermod -aG docker ec2-user # or the user running Ray
docker info
- Explicitly tell Ray to use Docker:
- In code:
ray.init(
address="auto",
runtime_env={
"container": {
"image": "your-image:tag",
"container_runtime": "docker",
"run_options": ["--network=host"]
}
}
)
- Or set on each node before starting Ray:
export RAY_CONTAINER_RUNTIME=docker
- Restart the Ray cluster after installing/configuring Docker.
2) If you intended to use Podman
- Install Podman on every Ray node and make sure it’s on PATH:
sudo yum install -y podman
podman info
- Or set container_runtime: podman explicitly in your runtime_env["container"].
3) Don’t use containerized runtime_env
- Remove the runtime_env["container"] block and run directly on the host.
- Provision dependencies on all Ray nodes by:
- Baking them into your base AMI or startup scripts (cloud-init/user data)
- Or using Ray’s non-container runtime_env (e.g., "pip": ["pkg==x.y"]) if that suits your needs
Why not Prefect “setup” steps?
- In Prefect 3.x there’s no setup_commands. You can use prefect.yaml pull steps (e.g., pip_install_requirements, run_shell_script), but these only affect the process where the flow runs (the driver). They won’t install Docker/Podman or packages on separate Ray worker nodes. You must provision the Ray nodes themselves.
Quick diagnostics to confirm on workers
- From a worker node:
which docker
which podman
docker info
echo $PATH
- Or from Python via Ray:
import ray, shutil, os, platform
ray.init(address="auto")
@ray.remote
def check():
return {
"node": platform.node(),
"docker": shutil.which("docker"),
"podman": shutil.which("podman"),
"path": os.environ.get("PATH"),
}
print(ray.get(check.remote()))
If you can share:
- Your Ray version
- The ray.init(...) snippet (especially the runtime_env)
- How the cluster is launched (Ray cluster YAML, managed service, or custom EC2 + ray start)
…I can tailor the exact steps and config.