Saad80
10/09/2025, 10:20 PMMarvin
10/09/2025, 10:20 PMMarvin
10/09/2025, 10:24 PMjob_variables.
Example: override the image at trigger time
from prefect.deployments import run_deployment
run_deployment(
name="my-flow/my-deployment",
parameters={"some_param": "value"},
job_variables={"image": "<http://ghcr.io/your-org/prefect-ray:py310-cu121|ghcr.io/your-org/prefect-ray:py310-cu121>"},
)
If you deploy in Python (Prefect 3.x), you can also set the image when creating the deployment:
# sketch
my_flow.from_source(
source="<https://github.com/your-org/your-repo>",
entrypoint="flows/ray_flow.py:my_flow",
).deploy(
name="ray-flow",
work_pool_name="kubernetes-pool", # or docker, ecs, etc.
image="<http://ghcr.io/your-org/prefect-ray:py310|ghcr.io/your-org/prefect-ray:py310>",
)
Docs: Customize job variables
2) The container image used by the Ray cluster to execute Ray tasks/actors
- Prefect cannot change this per flow run. The Ray cluster’s worker pods/processes define their image (e.g., in KubeRay via the RayCluster spec). The Prefect flow can connect to that cluster, but image selection for Ray workers is a Ray/Kubernetes concern.
- If you need different images for different workloads on Ray, typical approaches are:
- Run multiple Ray clusters (each with its own image) and point your flow’s RayTaskRunner(address=...) to the desired cluster.
- In Kubernetes with KubeRay, define multiple worker groups with different images and route tasks using Ray resource tags/placement (Ray-specific pattern).
- If you’re using Ray Jobs and your environment supports it, submit jobs with a runtime environment that specifies the image (Ray feature; outside of Prefect-Ray’s task runner).
Minimal Prefect + Ray example (Prefect controls the flow image; Ray runs tasks on its own cluster):
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
@flow(task_runner=RayTaskRunner(address="<ray://your-ray-head:10001>"))
def my_flow():
# your tasks here; they will run on the Ray cluster
...
# Deploy this flow with different images per deployment or override per run via job_variables
Prefect-Ray docs: Prefect-Ray integration
To make sure I point you to the exact setup:
- Where is your Ray cluster running (Kubernetes via KubeRay, VMs, etc.)?
- Which work pool type are you using for Prefect (Docker, Kubernetes, ECS)?
- Do you want to dynamically choose the Prefect flow’s container image, or the Ray cluster’s worker image?Saad80
10/09/2025, 10:29 PMMarvin
10/09/2025, 10:29 PMMarvin
10/09/2025, 10:31 PMinit_kwargs to RayTaskRunner works, but it won’t change the image of an existing Ray cluster’s workers. Whether your snippet does what you want depends on how you’re using Ray.
- If you’re connecting to an existing Ray cluster (e.g., address="ray://..." via KubeRay): neither
- init_kwargs={"runtime_env": {"image_uri": ...}} nor
- init_kwargs={"runtime_env": {"container": {"image": ...}}}
will switch the Kubernetes pod image. The worker image is fixed by the RayCluster spec (KubeRay) or how the cluster was started. You’ll need to:
- Run separate Ray clusters with different images and choose the cluster per flow (via RayTaskRunner(address=...)), or
- Define multiple worker groups in the RayCluster with different images and route tasks using Ray resource tags/placement.
- If you’re starting Ray locally (no address) and want Ray to run your tasks inside a container on the same host:
- init_kwargs={"runtime_env": {"container": {"image": "your-image"}}} is the Ray runtime env containerization form. This requires Docker on the host and uses Ray’s container runtime env plugin. It does not affect remote KubeRay clusters.
- If you’re using Ray Jobs (job submission API), Ray supports specifying an image at the job level via the job’s runtime env (commonly image_uri). Prefect-Ray’s RayTaskRunner does not submit Ray Jobs; it calls ray.init(...), so job-level image overrides don’t apply here.
How to pass init_kwargs with Prefect-Ray:
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
# Connect to an existing cluster (won’t change its pod image)
@flow(task_runner=RayTaskRunner(address="<ray://ray-head:10001>",
init_kwargs={"runtime_env": {"env_vars": {"FOO": "bar"}}}))
def my_flow():
...
# Start a local Ray runtime and run tasks in a Docker image (host must have Docker)
@flow(task_runner=RayTaskRunner(init_kwargs={
"runtime_env": {"container": {"image": "<http://ghcr.io/your-org/ray-runtime:py310|ghcr.io/your-org/ray-runtime:py310>"}}
}))
def local_containerized_flow():
...
If your goal is to pick the container per flow run:
- For the Prefect flow runtime container: override the deployment’s image per run using `job_variables`:
from prefect.deployments import run_deployment
run_deployment(
name="my-flow/my-deployment",
job_variables={"image": "<http://ghcr.io/your-org/prefect-flow:py310-cu121|ghcr.io/your-org/prefect-flow:py310-cu121>"},
)
Docs: Customize job variables
- For the Ray worker image: configure it in Ray/KubeRay (e.g., RayCluster spec or separate clusters) and select which cluster to use in your flow via RayTaskRunner(address=...).
If you can share:
- Are you using KubeRay?
- Do you want to switch the Prefect flow’s container image, the Ray cluster’s worker image, or both?
…I can suggest the exact config and example YAML/Python.Saad80
10/09/2025, 10:33 PMMarvin
10/09/2025, 10:33 PMMarvin
10/09/2025, 10:34 PMinit_kwargs={"runtime_env": {"image_uri": ...}} — this is for Ray Jobs (job submission API). It sets the container image for the job’s runtime env when you submit a job to the Ray Job server.
- init_kwargs={"runtime_env": {"container": {"image": ...}}} — this is the Ray runtime env “container” option for ray.init(...). It can run tasks/actors in that container image if Docker is available on every node and Ray’s container runtime is supported in your setup.
Important limitation on plain EC2 clusters:
- Neither of those will “swap” the image of an already-running Ray cluster’s worker processes started by the autoscaler. They only affect the runtime env for your job/tasks. If your cluster is launched with a fixed Docker image (via Ray cluster launcher YAML) or a specific AMI, that’s the base image workers use.
Given your goals (switch the Ray worker image per flow) on EC2, here are your practical options:
Option A — Multiple Ray clusters (recommended on plain EC2)
- Stand up multiple Ray clusters, each using the desired worker image (via the Ray cluster launcher YAML’s docker section or different AMIs).
- In your Prefect flow, point RayTaskRunner(address=...) to the cluster you want per flow/deployment. A simple pattern is to read the address from an env var and set it per deployment/job:
# flow.py
import os
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
RAY_ADDRESS = os.getenv("RAY_ADDRESS") # e.g., ray://<head-ip>:10001
@flow(task_runner=RayTaskRunner(address=RAY_ADDRESS))
def my_flow():
...
# When deploying, set the env var per deployment/run
# e.g., via job_variables
# job_variables={"env": {"RAY_ADDRESS": "<ray://ec2-gpu-head:10001>"}}
- This lets each Prefect deployment (or run) select a different Ray cluster (and thus a different worker image).
Option B — Use the Ray Jobs API with per-job image
- Enable the Ray Job server on your EC2 cluster and submit jobs with runtime_env={"image_uri": "<your-image>"}. This sets the image for that job run.
- You can trigger jobs from a Prefect flow:
from ray.job_submission import JobSubmissionClient
client = JobSubmissionClient("http://<head-node-ip>:8265")
job_id = client.submit_job(
entrypoint="python -c 'import your_module; your_module.main()'",
runtime_env={"image_uri": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/ray:gpu|123456789012.dkr.ecr.us-east-1.amazonaws.com/ray:gpu>"},
# env_vars / working_dir / pip, etc. as needed
)
- Note: Prefect-Ray’s RayTaskRunner does not submit Ray Jobs; it calls ray.init(...). If you want per-job images without multiple clusters, Jobs is the cleanest route.
Option C — Ray runtime_env “container” for tasks/actors
- If all cluster nodes have Docker and Ray supports containerized workers in your setup, you can try:
@flow(task_runner=RayTaskRunner(
address="ray://<head-ip>:10001",
init_kwargs={"runtime_env": {"container": {"image": "your-image:tag"}}},
))
def my_flow():
...
- This can run your tasks/actors inside that image, but it does not change the base worker processes/image that the autoscaler brings up. Operationally, this is more fragile on plain EC2 than Options A/B.
Should you use KubeRay?
- Not required, but it makes this use case much easier at scale. You can:
- Run multiple RayCluster CRs with different worker images and choose per flow via address, or
- Define multiple worker groups in one cluster with different images and route tasks via Ray resource tags/placement.
- If you anticipate many variants or frequent changes, KubeRay is the smoother path.
Prefect bits you’ll likely use:
- Per-run flow container image (not Ray workers) via job variables:
from prefect.deployments import run_deployment
run_deployment(
name="my-flow/my-deployment",
job_variables={"image": "<http://ghcr.io/your-org/prefect-flow:py310-cu121|ghcr.io/your-org/prefect-flow:py310-cu121>"},
)
- Docs:Marvin
10/09/2025, 10:34 PMSaad80
10/09/2025, 10:37 PMray.init(...)."? I already have an existing ray cluster. Doesnt init mean it starts a new cluster or something?Marvin
10/09/2025, 10:37 PMMarvin
10/09/2025, 10:39 PMray.init() behavior:
- Without an address: starts a local Ray instance on the current machine.
- With an address (e.g., address="ray://<head-ip>:10001"): connects to an existing cluster via Ray Client and does NOT start a new cluster.
- Prefect’s RayTaskRunner calls ray.init(...) under the hood. If you provide an address, it attaches to your existing cluster; if not, it’ll start a local Ray on the worker.
- Docs:
- Ray Client/`ray://` overview: Ray Client
- ray.init API: ray.init reference
2) Option C (run tasks/actors in a container image via runtime_env) — details and how to use it
What it does
- You can ask Ray to execute tasks/actors inside a Docker container on each node by setting a runtime environment with a container spec.
- This does NOT replace the Ray worker process image or the EC2 AMI; instead, Ray launches your user code in containers on the nodes.
- You can set:
- A default container for your session/driver (via ray.init(runtime_env=...)).
- Per-task/actor overrides (via remote_options(runtime_env=...)).
- Requirements:
- Docker must be installed and usable on every Ray node.
- Nodes must be able to pull your image (ECR/GCR auth, etc.).
- Your image should include a compatible Python + dependencies for your code. In practice, include ray too.
Official Ray docs
- Runtime environments (containers): Running tasks and actors in containers
How to use with Prefect-Ray
A. Set a default container for all Prefect tasks executed by this flow run:
- Pass init_kwargs={"runtime_env": {"container": {"image": "...", ...}}} to RayTaskRunner. This applies to all Ray tasks created by the driver (your Prefect flow).
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
@flow(
task_runner=RayTaskRunner(
address="ray://<head-ip>:10001",
init_kwargs={
"runtime_env": {
"container": {
"image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
# optional: pass Docker run flags
"run_options": [
"--gpus=all", # if you need GPUs
"--ipc=host",
# "--volume=/host/path:/container/path"
],
"env_vars": {"EXTRA_FLAG": "1"},
}
}
},
)
)
def my_flow():
# Prefect tasks in this flow will run as Ray tasks inside the container image above
...
B. Choose the image per task (or per section) at runtime
- Prefect-Ray provides a remote_options context manager that maps to Ray’s per-task .options(...). This lets you pick different images per task or per run based on a flow parameter or environment variable.
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options
@task
def heavy_gpu_work(x): ...
@task
def cpu_only_work(y): ...
@flow(task_runner=RayTaskRunner(address="ray://<head-ip>:10001"))
def my_flow(image_for_gpu: str, image_for_cpu: str):
# GPU section
with remote_options(
runtime_env={
"container": {
"image": image_for_gpu,
"run_options": ["--gpus=all"],
}
},
num_gpus=1,
):
a = heavy_gpu_work.submit(1)
# CPU section
with remote_options(
runtime_env={"container": {"image": image_for_cpu}},
num_cpus=2,
):
b = cpu_only_work.submit(2)
return a, b
Notes and gotchasMarvin
10/09/2025, 10:39 PM--gpus=all (or equivalent) to run_options and ensure the host has NVIDIA drivers installed.
- Network/volumes: use run_options and/or volume mounts as needed; ensure the container can reach any services your code needs.
- Precedence:
- Per-task remote_options(runtime_env=...) overrides the default ray.init(runtime_env=...).
- If you don’t set either, tasks run in the base environment of the node.
When to consider other options
- If you truly need to swap the base image of workers per flow, Option A (multiple clusters with different base images) or the Ray Jobs API (per-job runtime_env) may be operationally cleaner.
- Ray Jobs API docs: Ray Jobs API
Prefect-Ray docs
- Overview and examples: Prefect-Ray integration
If you confirm:
- Docker is installed on all EC2 nodes
- Your image registry and GPU requirements (if any)
…I can help tailor the exact runtime_env and run_options you’ll need.Saad80
10/09/2025, 10:42 PMMarvin
10/09/2025, 10:42 PMMarvin
10/09/2025, 10:43 PMsudo apt-get update
sudo apt-get install -y <http://docker.io|docker.io>
sudo usermod -aG docker $USER
# re-login or: newgrp docker
docker info # should succeed without sudo
- Make sure the docker binary is on PATH for Ray’s processes. If Ray is launched by systemd or a service, ensure that environment has PATH to docker, or set it explicitly.
- If podman is installed and Docker is not, you’ll see that error. Either install Docker and make sure it’s preferred on PATH, or remove podman.
- Private registries (ECR/GCR): configure node-level auth so containers can pull images non-interactively (e.g., Docker credential helpers or a periodic ECR login).
Ray docs on containerized tasks/actors (Option C):
- Running tasks and actors in containers: Ray: Running tasks and actors in containers
2) Option C in depth with Prefect + Ray
- What it does: Your Prefect flow (the Ray “driver”) connects to your existing cluster (ray://…). Ray then executes tasks/actors inside Docker containers on each node.
- This does NOT change the Ray worker processes’ base image or your AMIs; it just containers your user code.
- Requirements:
- Docker installed and accessible on all nodes
- Nodes can pull your image (auth if private)
- If you need GPUs in the container, you’ll need NVIDIA drivers on the host and pass GPU flags
A) Set a default container for the whole flow’s Ray session
- Pass runtime_env via RayTaskRunner.init_kwargs:
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
@flow(
task_runner=RayTaskRunner(
address="ray://<head-ip>:10001",
init_kwargs={
"runtime_env": {
"container": {
"image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
# optional: docker run flags
"run_options": [
"--gpus=all", # if you need GPUs
"--ipc=host",
# "--volume=/host/path:/container/path"
],
# pass env to the container
"env_vars": {"EXTRA_FLAG": "1"},
}
}
},
)
)
def my_flow():
...
B) Choose different images per task/section
- Use Prefect-Ray’s remote_options context to override per task/actor:
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options
@task
def heavy_gpu_work(x): ...
@task
def cpu_only_work(y): ...
@flow(task_runner=RayTaskRunner(address="ray://<head-ip>:10001"))
def my_flow(image_for_gpu: str, image_for_cpu: str):
# GPU section
with remote_options(
runtime_env={"container": {"image": image_for_gpu, "run_options": ["--gpus=all"]}},
num_gpus=1,
):
a = heavy_gpu_work.submit(1)
# CPU section
with remote_options(
runtime_env={"container": {"image": image_for_cpu}},
num_cpus=2,
):
b = cpu_only_work.submit(2)
return a, bMarvin
10/09/2025, 10:43 PMray.init() with an address (e.g., ray://...) connects to your existing cluster; it does not create a new one. Without an address, it starts a local Ray on the current machine.
- Prefect’s RayTaskRunner calls ray.init(...) under the hood. So with your existing cluster, you’ll pass address="ray://<head-ip>:10001" to attach to it.
- Docs:
- Ray Client (ray://): Ray Client
- ray.init reference: ray.init API
Operational checklist for Docker runtime_env
- On each EC2 node:
- Docker installed and running (docker ps works without sudo)
- The Ray process user belongs to docker group
- PATH includes docker for Ray launch environment
- Registry auth configured (if private images)
- If using GPUs:
- Host has NVIDIA drivers
- Use run_options like --gpus=all
- Test quickly:
- On a node, manually docker pull <your-image>
- In a simple Python shell connected via ray.init(address="ray://..."), run a trivial task with a runtime_env={"container": {"image": "<your-image>"}}
If you can share your AMI/OS and how you launch Ray (ray up YAML, systemd, etc.), I can provide exact install and PATH steps to make Docker available to Ray on those nodes.Saad80
10/09/2025, 10:53 PMMarvin
10/09/2025, 10:53 PMMarvin
10/09/2025, 10:54 PMecho $PATH
which docker
- If which docker prints nothing, the current shell cannot find Docker on PATH.
- Processes started by services (like systemd) may have a different PATH than your interactive shell. That’s why something that works in your SSH session can fail for a service.
Your current setup and the two common gotchas
You’re doing:
sudo yum install -y docker
sudo systemctl enable docker
sudo systemctl start docker
sudo usermod -aG docker ec2-user
Two important follow-ups:
1) Group membership refresh: Adding ec2-user to the docker group does not take effect until you start a new login session. Without that, the user (and any processes you start) won’t have permission to talk to Docker. Fix options:
- Log out and log back in, or
- Run newgrp docker in the provisioning script before starting Ray, or
- Start Ray only after a fresh login where groups shows docker.
2) Ensure Docker is discoverable for the Ray processes:
- When you start Ray (head and workers), do it from a shell/session where which docker works.
- If you use a service/daemon to start Ray, make sure that service’s environment includes /usr/bin (where Docker typically lives on Amazon Linux) so docker is on PATH for that service.
- If Podman is installed and Docker is not visible on PATH to the Ray process, Ray may attempt Podman and you’ll see “podman not found”.
I couldn’t find an official Ray setting to “force Docker over Podman” — Ray detects the container runtime by checking what’s available on PATH. The practical fix is to ensure Docker is installed, on PATH, and accessible; or remove Podman so there’s no ambiguity.
Simple Python shell test you can run on an EC2 node
Pick any node (head is easiest). This verifies:
- The process can see Docker on PATH
- You can connect to your cluster
- A task runs in your specified container image
1) Sanity checks on the node
which docker
docker ps
groups
Expect docker found, docker ps works without sudo, and docker in the group list. If not, log out/in or newgrp docker and try again.
2) Test from Python
Replace HEAD_ADDR with your Ray head’s address and IMAGE with your image.
```
python - << 'PY'
import os, shutil, ray, subprocess, json
HEAD_ADDR = "ray//<head ip>10001"
IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310"
print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
# Connect to your existing cluster (does not start a new one)
ray.init(address=HEAD_ADDR)
@ray.remote
def info():
import os, subprocess, shutil
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
"python": subprocess.getoutput("python -V 2>&1"),
"os_release": subprocess.getoutput("cat /etc/os-release || true"),
"has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0,
}
print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))
print("\n--- With container runtime_env ---")
task = info.options(
runtime_env={
"container": {
"image": IMAGE,
# Uncomment if you need GPUs or other flags
# "run_options": ["--gpus=all", "--ipc=host"],
# "env_vars": {"EXAMPLE": "1"},
}
}Marvin
10/09/2025, 10:54 PMWhat to look for:
- The “Without container” and “With container” sections should differ in `os_release` (confirming your container image ran).
- If you get errors like “podman not found” or “docker: permission denied”:
- Make sure `which docker` returns a path in both the driver and inside the non-container task.
- Re-check group membership took effect (new login/newgrp).
- Ensure Docker is installed on every worker node, not just the head.
- Try pre-pulling the image on nodes (`docker pull <IMAGE>`) to rule out registry/auth issues.
Using this with Prefect + Ray
Once the node-level test works, apply the same runtime_env via Prefect-Ray:
- Default container for all tasks in a flow:
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
@flow(
task_runner=RayTaskRunner(
address="ray//<head ip>10001",
init_kwargs={
"runtime_env": {
"container": {
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310",
# "run_options": ["--gpus=all"]
}
}
},
)
)
def my_flow():
...
- Different images per section/task:
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options
@task
def work(x): ...
@flow(task_runner=RayTaskRunner(address="ray//<head ip>10001"))
def my_flow(image1: str, image2: str):
with remote_options(runtime_env={"container": {"image": image1}}):
a = work.submit(1)
with remote_options(runtime_env={"container": {"image": image2}}):
b = work.submit(2)
return a, b
```
Relevant docs
- Running Ray tasks/actors in containers: Ray docs: containerized tasks/actors
- Ray Client (ray://) overview: Ray Client
- ray.init API: ray.init reference
- Prefect-Ray integration: Prefect-Ray
If you share how you start Ray on EC2 (manual, autoscaler, or a service), I can suggest the best place to refresh group membership and guarantee PATH visibility so Docker is always found.Saad80
10/09/2025, 11:02 PMMarvin
10/09/2025, 11:02 PMMarvin
10/09/2025, 11:03 PMdocker isn’t in any of those directories, the process won’t find it.
- Your interactive SSH session and the Ray worker processes may have different PATHs and group memberships. That’s why “docker works for me in SSH” can still fail for Ray tasks.
Step-by-step on the head node
0) Attach to the head node
- From your laptop:
ray attach ray-cluster-config-docker.yaml
This opens an SSH session to the head.
1) Verify Docker is usable without sudo
which docker
docker ps
id -nG
- If docker ps gives “permission denied” and you just added the user to the docker group, refresh group membership in this session:
newgrp docker
docker ps
- If which docker returns nothing, Docker isn’t on PATH in this shell. On Amazon Linux, Docker usually lives in /usr/bin. You can try:
export PATH="/usr/bin:$PATH"
which docker
Tip: Ray worker processes inherit env from the shell that started Ray. If Docker wasn’t on PATH or group membership wasn’t ready when Ray started, you may need to restart Ray after fixing it (see step 6).
2) Sanity check image pull (optional but recommended)
Replace IMAGE with your ECR image.
docker pull <http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>
If this fails with auth, ensure your instances have an IAM role with ECR read and that your ECR credential helper is configured.
3) Verify Ray is up and you can connect locally
ray status
python -c "import ray; print(ray.__version__)"
If Python can’t import Ray, use python3 or activate the env that Ray installed into (depends on your cluster YAML’s setup_commands). Often python already works.
4) Run the minimal Python test that compares “no container” vs “container” execution
Replace IMAGE with yours. This connects to the existing cluster (does not start a new one) and runs a tiny task twice: once normally, once inside your container.
python - << 'PY'
import os, shutil, ray, subprocess, json
IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310"
print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
# Connect to the local running cluster on the head
ray.init(address="auto")
@ray.remote
def info():
import os, subprocess, shutil
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
"python": subprocess.getoutput("python -V 2>&1"),
"os_release": subprocess.getoutput("cat /etc/os-release || true"),
"has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0,
}
print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))
print("\n--- With container runtime_env ---")
task = info.options(
runtime_env={
"container": {
"image": IMAGE,
# Uncomment/add if needed:
# "run_options": ["--gpus=all", "--ipc=host"],
# "env_vars": {"EXAMPLE": "1"},
}
}
).remote()
print(json.dumps(ray.get(task), indent=2))
PY
What success looks like
- The “With container runtime_env” output should show a different os_release that matches your container base. That proves tasks ran inside your image.
- which_docker in the “Without container” section should be non-null. If it’s null, the Ray worker environment doesn’t see Docker on PATH.
Common fixes if you still see “podman not found” or “docker not found/permission denied”
- Docker not found: ensure /usr/bin is on PATH and Docker is installed on ALL nodes (head + workers).Marvin
10/09/2025, 11:03 PMdocker group; after adding, start a new session (or use newgrp docker) and restart Ray so worker processes pick up the change.
- You may need to restart the cluster processes after fixing group/PATH:
- From the head: ray stop then re-run ray up ray-cluster-config-docker.yaml
- Or just re-run ray up from your laptop to redeploy/start Ray.
- Private ECR: make sure instances have IAM role with ECR pull perms, or that your ECR credential helper is configured under the same user that runs Ray.
Using this in Prefect after the test works
- Set the runtime_env globally for the flow:
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner
@flow(task_runner=RayTaskRunner(
address="ray://<head-ip>:10001", # or address="auto" if running on head
init_kwargs={
"runtime_env": {
"container": {
"image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
# "run_options": ["--gpus=all"]
}
}
}
))
def my_flow():
...
- Or choose different images per task section using `remote_options`:
from prefect_ray.context import remote_options
with remote_options(runtime_env={"container": {"image": "your-other-image"}}):
...
Official docs for what we’re using
- Running Ray tasks/actors in containers (runtime_env): Ray: containerized tasks/actors
- Ray Client and addresses: Ray Client
- ray.init reference: ray.init API
- Prefect-Ray integration: Prefect-Ray
If you hit an error during the test, paste the outputs for:
- which docker, docker ps, id -nG
- The JSON from both “Without container” and “With container” sections
…and I’ll help pinpoint whether it’s PATH, group membership, image pull, or something else.Saad80
10/09/2025, 11:18 PMray status to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +10s) Adding 1 node(s) of type ray.worker.default.
{
"PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin",
"which_docker": "/usr/bin/docker",
"whoami": "ec2-user",
"python": "Python 3.11.10",
"os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"",
"has_nvidia_smi": false
}
--- With container runtime_env ---
(raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found
(autoscaler +40s) Resized to 2 CPUs.
(raylet, ip=172.31.25.199) [2025-10-09 231759,241 E 5840 5840] (raylet) worker_pool.cc:553: Some workers of the worker process(5967) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet, ip=172.31.25.199) bash: line 0: exec: podman: not foundMarvin
10/09/2025, 11:18 PMMarvin
10/09/2025, 11:20 PMssh -i /path/to/your/key.pem ec2-user@172.31.25.199
2) Check Docker availability and permissions
which docker
docker ps
id -nG
- You want:
- which docker -> /usr/bin/docker (or similar)
- docker ps works without sudo
- your user (e.g., ec2-user) is in the docker group
3) If Docker is missing or needs setup on the worker
- Install and enable:
sudo yum install -y docker
sudo systemctl enable --now docker
sudo usermod -aG docker ec2-user
# Open a fresh login OR run:
newgrp docker
docker ps
- If you’re pulling from ECR, make sure the instance has an IAM role with ECR pull perms and that your credential helper is configured for the same user that runs Ray.
4) Make it automatic for all future workers
- Update your Ray cluster YAML so workers get Docker too, not just the head. Use setup_commands (applies to head+workers) or worker_setup_commands. Example (simplified):
# ray-cluster-config-docker.yaml (excerpt)
auth:
ssh_user: ec2-user
setup_commands:
- sudo yum install -y docker
- sudo systemctl enable --now docker
- sudo usermod -aG docker ec2-user
# ensure current shell has docker group for subsequent commands
- newgrp docker || true
- docker --version
- Alternatively, bake Docker + group membership into the AMI used for BOTH head and workers, and reference that AMI in available_node_types for workers.
5) Restart Ray to ensure new nodes inherit the correct env/groups
- After adjusting your setup (or AMI), restart the cluster so new workers are provisioned with Docker:
- Simplest: scale down to zero workers, then back up; or tear down and ray up again against the updated YAML.
Re-run the quick Python test on the head
- This time, force it to use a worker (by letting the autoscaler bring one up) and confirm the containerized task works.
For convenience, here’s the same test block you can paste on the head (replace IMAGE):
```
python - << 'PY'
import os, shutil, ray, subprocess, json
IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310"
print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
ray.init(address="auto")
@ray.remote
def info():
import os, subprocess, shutil
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
"python": subprocess.getoutput("python -V 2>&1"),
"os_release": subprocess.getoutput("cat /etc/os-release || true"),
"has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0,
}
print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))
print("\n--- With container runtime_env ---")
task = info.options(
runtime_env={
"container": {
"image": IMAGE,
# "run_options": ["--gpus=all", "--ipc=host"], # if needed
}Marvin
10/09/2025, 11:20 PMRayTaskRunner(init_kwargs={...}).
If you share your cluster YAML (sanitized) I can point to the exact section to add the worker Docker setup so the autoscaler always brings up container-ready nodes.Saad80
10/09/2025, 11:27 PMMarvin
10/09/2025, 11:28 PMMarvin
10/09/2025, 11:29 PMdocker ps | grep -i ray (you should see a container for ray head/worker)
- Or: ps -ef | grep raylet then sudo nsenter -t <raylet-pid> -m -u -i -n env | grep PATH (advanced)
- Also, jump into the Ray container and check docker availability:
- docker exec -it <ray-container-name> bash -lc 'which docker || echo no-docker; ls -l /var/run/docker.sock || echo no-sock'
Why your Python test fails
- The containerized tasks feature (runtime_env["container"]) shells out to a container runtime from the raylet. Since raylet is inside a container without docker, it can’t call docker on the host unless you:
- install a docker CLI inside the Ray container, and
- mount the host’s Docker socket into the Ray container, and
- ensure the user in the Ray container can access the socket.
Two ways to fix it
Option 1 — Run Ray directly on the host (simplest)
- Remove the docker: section from your cluster YAML so Ray doesn’t run inside a container.
- Recreate/scale your cluster.
- Your existing runtime_env container test should start working immediately because raylet will now see /usr/bin/docker on the host.
Option 2 — Keep Ray in Docker, but give it access to Docker
- Update the cluster YAML’s docker section to:
- Use a Ray image that has the docker CLI installed, or extend it yourself:
- Example Dockerfile:
FROM rayproject/ray:2.48.0-py310
RUN apt-get update && apt-get install -y <http://docker.io|docker.io> && rm -rf /var/lib/apt/lists/*
- Mount the Docker socket and any needed privileges:
- In your YAML docker.run_options, add:
- --volume=/var/run/docker.sock:/var/run/docker.sock
- --ipc=host
- Optional: --privileged (or granular capabilities if needed)
- Ensure the container user can talk to the socket:
- The Docker socket is typically owned by group “docker” on the host. You can:
- Run the Ray container as root (quick and dirty), or
- Create a matching group inside the container with the host’s docker group GID and add the container user to it.
- After these changes, redeploy/scale your cluster so new Ray containers have docker and the socket.
After applying either option, re-run the Python test on the head
- Same test you used; it should now run the “With container runtime_env” section successfully and you should see a different os_release (from your image).
If you want to verify the raylet’s environment right now
- On the worker:
- Find the raylet PID: pgrep -f raylet
- Inspect its env PATH: sudo tr '\0' '\n' </proc/$(pgrep -f raylet | head -n1)/environ | grep PATH
- If Ray is running in Docker, that PATH will be the container’s PATH and won’t include docker unless you install it inside that container.
Docs for what we’re using
- Running tasks and actors in containers (runtime_env): Ray: containerized tasks/actors
- Ray Client (connecting to an existing cluster): Ray Client
- ray.init reference: ray.init API
- Prefect-Ray integration: Prefect-Ray integrationMarvin
10/09/2025, 11:29 PMSaad80
10/09/2025, 11:31 PMMarvin
10/09/2025, 11:31 PMMarvin
10/09/2025, 11:33 PM# 1) Confirm raylet PID (on the worker)
pgrep -f raylet
# 2) Show the PATH visible to the raylet process (replace <PID> with output above)
sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH=
# 3) Sanity check your current shell
echo $PATH
which docker
id -nG
# 4) Check runtime env agent logs for clues
ls -1 /tmp/ray/session_*/logs | grep runtime_env
grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100
If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet, that explains why it can’t find docker at container launch time even though you can find it interactively.
B) Pin a simple “info” task to the worker (no container) to see what that worker reports
Run this from the head (or anywhere that can connect), but force the task onto 172.31.25.199:
python - << 'PY'
import ray, json, os, shutil
ray.init(address="auto")
# Make an info task
@ray.remote
def info():
import os, shutil, subprocess
return {
"PATH": os.environ.get("PATH"),
"which_docker": shutil.which("docker"),
"whoami": subprocess.getoutput("whoami"),
}
# Force the task to run on the target worker via node resource
target_ip = "172.31.25.199"
res = ray.get(info.options(resources={f"node:{target_ip}": 0.001}).remote())
print(json.dumps(res, indent=2))
PY
If which_docker comes back null here, the environment that Ray gives to workers on that node cannot see docker.
C) How to fix (common causes + remedies)
- PATH not present for Ray processes:
- The Ray autoscaler starts ray processes with a different environment than your interactive shell. Ensure /usr/bin is on PATH in the environment used to start ray. You can enforce PATH in your cluster YAML start commands:
- Example (add to worker_start_ray_commands and head_start_ray_commands):
head_start_ray_commands:
- 'export PATH=/usr/bin:$PATH; ray stop'
- 'export PATH=/usr/bin:$PATH; ray start --head --dashboard-host=0.0.0.0 --port=6379'
worker_start_ray_commands:
- 'export PATH=/usr/bin:$PATH; ray stop'
- 'export PATH=/usr/bin:$PATH; ray start --address=$RAY_HEAD_IP:6379'
Adjust flags to match your config. Re-run ray up after editing.
- Group membership timing:
- If you add ec2-user to the docker group in setup_commands, the membership won’t apply to already-running processes. Make sure Docker install + usermod happen before Ray starts, and start Ray only in a fresh session where groups shows docker. A simple way:
- Do the docker install/usermod in setup_commands
- In start_ray_commands, do not restart usermod; just start Ray. If needed, force a login shell or use newgrp docker before ray start.
- Confirm docker and ECR access for the raylet:
- After you fix PATH, try the same Python “with container” test again. If you hit auth issues, pre-pull the image on the worker and/or ensure the instance profile gives ECR pull permissions.
D) Re-run the container test (on the head)
Same script you used before; it should now succeed:
```
python - << 'PY'
import os, shutil, ray, subprocess, json
IMAGE = "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"
print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))
ray.init(address="auto")Marvin
10/09/2025, 11:33 PME) Why “podman: not found” despite docker being installed
- Ray’s containerized tasks feature shells out from the raylet/runtime-env agent. If the raylet’s environment cannot find docker (binary not on PATH at the time the process was started), Ray may try podman. Your interactive shell is fine, but the raylet process environment may be missing PATH entries.
- Ensuring PATH (and group membership) for the process that starts Ray fixes this.
Once this test works, you can use the same runtime_env in Prefect-Ray’s RayTaskRunner:
- Global for the flow:
@flow(task_runner=RayTaskRunner(
address="ray//<head ip>10001",
init_kwargs={
"runtime_env": {"container": {"image": "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"}}
},
))
def my_flow():
...
```
- Or per section with remote_options.
Docs
- Ray: Running tasks/actors in containers: Ray docs: containerized tasks/actors
- Ray Client (ray://): Ray Client
- ray.init reference: ray.init API
- Prefect-Ray: Prefect-Ray integration
If you can paste:
- The PATH seen by the worker’s raylet (proc environ), and
- The output of the pinned info task on 172.31.25.199,
…I can advise exactly which YAML changes to make so every autoscaled worker reliably sees docker.