< Marvin> I need to get my ray cluster to pick up what docke Prefect Community #ask-marvin

<@ULVA73B9P> I need to get my ray cluster to pick ...

Saad80

10/09/2025, 10:20 PM

@Marvin I need to get my ray cluster to pick up what docker container to use depending on what I define in my flow. How do I do that?

Marvin

10/09/2025, 10:20 PM

✅ thought for 272.3 seconds

Marvin

10/09/2025, 10:24 PM

Great question — there are two “images” at play when you combine Prefect and Ray. Which one do you want to switch per flow? 1) The container image that runs your Prefect flow run (i.e., what your Prefect worker launches) - You can set this per-deployment or even per-run. This is usually what you want if your flow container needs different Python/system deps before it connects to a Ray cluster. - Options: - At deploy time: set the image on the deployment. - Per run: override with

job_variables

. Example: override the image at trigger time

Copy code

from prefect.deployments import run_deployment

run_deployment(
    name="my-flow/my-deployment",
    parameters={"some_param": "value"},
    job_variables={"image": "<http://ghcr.io/your-org/prefect-ray:py310-cu121|ghcr.io/your-org/prefect-ray:py310-cu121>"},
)

If you deploy in Python (Prefect 3.x), you can also set the image when creating the deployment:

Copy code

# sketch
my_flow.from_source(
    source="<https://github.com/your-org/your-repo>",
    entrypoint="flows/ray_flow.py:my_flow",
).deploy(
    name="ray-flow",
    work_pool_name="kubernetes-pool",  # or docker, ecs, etc.
    image="<http://ghcr.io/your-org/prefect-ray:py310|ghcr.io/your-org/prefect-ray:py310>",
)

Docs: Customize job variables 2) The container image used by the Ray cluster to execute Ray tasks/actors - Prefect cannot change this per flow run. The Ray cluster’s worker pods/processes define their image (e.g., in KubeRay via the RayCluster spec). The Prefect flow can connect to that cluster, but image selection for Ray workers is a Ray/Kubernetes concern. - If you need different images for different workloads on Ray, typical approaches are: - Run multiple Ray clusters (each with its own image) and point your flow’s

RayTaskRunner(address=...)

to the desired cluster. - In Kubernetes with KubeRay, define multiple worker groups with different images and route tasks using Ray resource tags/placement (Ray-specific pattern). - If you’re using Ray Jobs and your environment supports it, submit jobs with a runtime environment that specifies the image (Ray feature; outside of Prefect-Ray’s task runner). Minimal Prefect + Ray example (Prefect controls the flow image; Ray runs tasks on its own cluster):

Copy code

from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(task_runner=RayTaskRunner(address="<ray://your-ray-head:10001>"))
def my_flow():
    # your tasks here; they will run on the Ray cluster
    ...

# Deploy this flow with different images per deployment or override per run via job_variables

Prefect-Ray docs: Prefect-Ray integration To make sure I point you to the exact setup: - Where is your Ray cluster running (Kubernetes via KubeRay, VMs, etc.)? - Which work pool type are you using for Prefect (Docker, Kubernetes, ECS)? - Do you want to dynamically choose the Prefect flow’s container image, or the Ray cluster’s worker image?

Saad80

10/09/2025, 10:29 PM

@Marvin what about using # init_kwargs={"runtime_env": {"image_uri": get_ray_image_uri()}} # for ray 2.48.0 # init_kwargs={"runtime_env": {"container": {"image": get_ray_image_uri()}}}?

Marvin

10/09/2025, 10:29 PM

✅ thought for 114.2 seconds

Marvin

10/09/2025, 10:31 PM

Short answer: Passing

init_kwargs

RayTaskRunner

works, but it won’t change the image of an existing Ray cluster’s workers. Whether your snippet does what you want depends on how you’re using Ray. - If you’re connecting to an existing Ray cluster (e.g.,

address="ray://..."

via KubeRay): neither -

init_kwargs={"runtime_env": {"image_uri": ...}}

nor -

init_kwargs={"runtime_env": {"container": {"image": ...}}}

will switch the Kubernetes pod image. The worker image is fixed by the RayCluster spec (KubeRay) or how the cluster was started. You’ll need to: - Run separate Ray clusters with different images and choose the cluster per flow (via

RayTaskRunner(address=...)

), or - Define multiple worker groups in the RayCluster with different images and route tasks using Ray resource tags/placement. - If you’re starting Ray locally (no

address

) and want Ray to run your tasks inside a container on the same host: -

init_kwargs={"runtime_env": {"container": {"image": "your-image"}}}

is the Ray runtime env containerization form. This requires Docker on the host and uses Ray’s container runtime env plugin. It does not affect remote KubeRay clusters. - If you’re using Ray Jobs (job submission API), Ray supports specifying an image at the job level via the job’s runtime env (commonly

image_uri

). Prefect-Ray’s

RayTaskRunner

does not submit Ray Jobs; it calls

ray.init(...)

, so job-level image overrides don’t apply here. How to pass init_kwargs with Prefect-Ray:

Copy code

from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

# Connect to an existing cluster (won’t change its pod image)
@flow(task_runner=RayTaskRunner(address="<ray://ray-head:10001>",
                                init_kwargs={"runtime_env": {"env_vars": {"FOO": "bar"}}}))
def my_flow():
    ...

# Start a local Ray runtime and run tasks in a Docker image (host must have Docker)
@flow(task_runner=RayTaskRunner(init_kwargs={
    "runtime_env": {"container": {"image": "<http://ghcr.io/your-org/ray-runtime:py310|ghcr.io/your-org/ray-runtime:py310>"}}
}))
def local_containerized_flow():
    ...

If your goal is to pick the container per flow run: - For the Prefect flow runtime container: override the deployment’s image per run using `job_variables`:

Copy code

from prefect.deployments import run_deployment

run_deployment(
    name="my-flow/my-deployment",
    job_variables={"image": "<http://ghcr.io/your-org/prefect-flow:py310-cu121|ghcr.io/your-org/prefect-flow:py310-cu121>"},
)

Docs: Customize job variables - For the Ray worker image: configure it in Ray/KubeRay (e.g., RayCluster spec or separate clusters) and select which cluster to use in your flow via

RayTaskRunner(address=...)

. If you can share: - Are you using KubeRay? - Do you want to switch the Prefect flow’s container image, the Ray cluster’s worker image, or both? …I can suggest the exact config and example YAML/Python.

Saad80

10/09/2025, 10:33 PM

@Marvin • using plain ray cluster deployed via ec2 not kuberay. Should I be using KubeRay? • I want to switch the ray cluster worker image depending on the flow

Marvin

10/09/2025, 10:33 PM

✅ thought for 54.9 seconds

Marvin

10/09/2025, 10:34 PM

Thanks — this helps a lot. Directly answering your snippets: -

init_kwargs={"runtime_env": {"image_uri": ...}}

— this is for Ray Jobs (job submission API). It sets the container image for the job’s runtime env when you submit a job to the Ray Job server. -

init_kwargs={"runtime_env": {"container": {"image": ...}}}

— this is the Ray runtime env “container” option for

ray.init(...)

. It can run tasks/actors in that container image if Docker is available on every node and Ray’s container runtime is supported in your setup. Important limitation on plain EC2 clusters: - Neither of those will “swap” the image of an already-running Ray cluster’s worker processes started by the autoscaler. They only affect the runtime env for your job/tasks. If your cluster is launched with a fixed Docker image (via Ray cluster launcher YAML) or a specific AMI, that’s the base image workers use. Given your goals (switch the Ray worker image per flow) on EC2, here are your practical options: Option A — Multiple Ray clusters (recommended on plain EC2) - Stand up multiple Ray clusters, each using the desired worker image (via the Ray cluster launcher YAML’s

docker

section or different AMIs). - In your Prefect flow, point

RayTaskRunner(address=...)

to the cluster you want per flow/deployment. A simple pattern is to read the address from an env var and set it per deployment/job:

Copy code

# flow.py
import os
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

RAY_ADDRESS = os.getenv("RAY_ADDRESS")  # e.g., ray://<head-ip>:10001

@flow(task_runner=RayTaskRunner(address=RAY_ADDRESS))
def my_flow():
    ...

# When deploying, set the env var per deployment/run
# e.g., via job_variables
# job_variables={"env": {"RAY_ADDRESS": "<ray://ec2-gpu-head:10001>"}}

- This lets each Prefect deployment (or run) select a different Ray cluster (and thus a different worker image). Option B — Use the Ray Jobs API with per-job image - Enable the Ray Job server on your EC2 cluster and submit jobs with

runtime_env={"image_uri": "<your-image>"}

. This sets the image for that job run. - You can trigger jobs from a Prefect flow:

Copy code

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://<head-node-ip>:8265")
job_id = client.submit_job(
    entrypoint="python -c 'import your_module; your_module.main()'",
    runtime_env={"image_uri": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/ray:gpu|123456789012.dkr.ecr.us-east-1.amazonaws.com/ray:gpu>"},
    # env_vars / working_dir / pip, etc. as needed
)

- Note: Prefect-Ray’s

RayTaskRunner

does not submit Ray Jobs; it calls

ray.init(...)

. If you want per-job images without multiple clusters, Jobs is the cleanest route. Option C — Ray runtime_env “container” for tasks/actors - If all cluster nodes have Docker and Ray supports containerized workers in your setup, you can try:

Copy code

@flow(task_runner=RayTaskRunner(
    address="ray://<head-ip>:10001",
    init_kwargs={"runtime_env": {"container": {"image": "your-image:tag"}}},
))
def my_flow():
    ...

- This can run your tasks/actors inside that image, but it does not change the base worker processes/image that the autoscaler brings up. Operationally, this is more fragile on plain EC2 than Options A/B. Should you use KubeRay? - Not required, but it makes this use case much easier at scale. You can: - Run multiple RayCluster CRs with different worker images and choose per flow via

address

, or - Define multiple worker groups in one cluster with different images and route tasks via Ray resource tags/placement. - If you anticipate many variants or frequent changes, KubeRay is the smoother path. Prefect bits you’ll likely use: - Per-run flow container image (not Ray workers) via job variables:

Copy code

from prefect.deployments import run_deployment

run_deployment(
    name="my-flow/my-deployment",
    job_variables={"image": "<http://ghcr.io/your-org/prefect-flow:py310-cu121|ghcr.io/your-org/prefect-flow:py310-cu121>"},
)

- Docs:

Marvin

10/09/2025, 10:34 PM

- Prefect-Ray integration: Prefect-Ray - Prefect job variables: Customize job variables - Ray Jobs API: Ray Jobs API - Ray runtime_env and containers: Ray runtime_env containers If you can share how you start your EC2 cluster (Ray cluster launcher YAML? Docker section? AMIs?) and whether you can enable the Job server, I can propose a concrete path and example configs.

Saad80

10/09/2025, 10:37 PM

@Marvin I want to try Option C. Can you explain it more in depth and give some documentation/links to support what you are saying? Also, wym "`RayTaskRunner` does not submit Ray Jobs; it calls

ray.init(...)

."? I already have an existing ray cluster. Doesnt init mean it starts a new cluster or something?

Marvin

10/09/2025, 10:37 PM

✅ thought for 73.8 seconds

Marvin

10/09/2025, 10:39 PM

Great questions — let me break down both parts. 1) What ray.init does (and what RayTaskRunner does with it) -

ray.init()

behavior: - Without an address: starts a local Ray instance on the current machine. - With an address (e.g.,

address="ray://<head-ip>:10001"

): connects to an existing cluster via Ray Client and does NOT start a new cluster. - Prefect’s

RayTaskRunner

calls

ray.init(...)

under the hood. If you provide an address, it attaches to your existing cluster; if not, it’ll start a local Ray on the worker. - Docs: - Ray Client/`ray://` overview: Ray Client -

ray.init

API: ray.init reference 2) Option C (run tasks/actors in a container image via runtime_env) — details and how to use it What it does - You can ask Ray to execute tasks/actors inside a Docker container on each node by setting a runtime environment with a container spec. - This does NOT replace the Ray worker process image or the EC2 AMI; instead, Ray launches your user code in containers on the nodes. - You can set: - A default container for your session/driver (via

ray.init(runtime_env=...)

). - Per-task/actor overrides (via

remote_options(runtime_env=...)

). - Requirements: - Docker must be installed and usable on every Ray node. - Nodes must be able to pull your image (ECR/GCR auth, etc.). - Your image should include a compatible Python + dependencies for your code. In practice, include

ray

too. Official Ray docs - Runtime environments (containers): Running tasks and actors in containers How to use with Prefect-Ray A. Set a default container for all Prefect tasks executed by this flow run: - Pass

init_kwargs={"runtime_env": {"container": {"image": "...", ...}}}

RayTaskRunner

. This applies to all Ray tasks created by the driver (your Prefect flow).

Copy code

from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(
    task_runner=RayTaskRunner(
        address="ray://<head-ip>:10001",
        init_kwargs={
            "runtime_env": {
                "container": {
                    "image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
                    # optional: pass Docker run flags
                    "run_options": [
                        "--gpus=all",           # if you need GPUs
                        "--ipc=host",
                        # "--volume=/host/path:/container/path"
                    ],
                    "env_vars": {"EXTRA_FLAG": "1"},
                }
            }
        },
    )
)
def my_flow():
    # Prefect tasks in this flow will run as Ray tasks inside the container image above
    ...

B. Choose the image per task (or per section) at runtime - Prefect-Ray provides a

remote_options

context manager that maps to Ray’s per-task

.options(...)

. This lets you pick different images per task or per run based on a flow parameter or environment variable.

Copy code

from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

@task
def heavy_gpu_work(x): ...
@task
def cpu_only_work(y): ...

@flow(task_runner=RayTaskRunner(address="ray://<head-ip>:10001"))
def my_flow(image_for_gpu: str, image_for_cpu: str):
    # GPU section
    with remote_options(
        runtime_env={
            "container": {
                "image": image_for_gpu,
                "run_options": ["--gpus=all"],
            }
        },
        num_gpus=1,
    ):
        a = heavy_gpu_work.submit(1)

    # CPU section
    with remote_options(
        runtime_env={"container": {"image": image_for_cpu}},
        num_cpus=2,
    ):
        b = cpu_only_work.submit(2)

    return a, b

Notes and gotchas

Marvin

10/09/2025, 10:39 PM

- This does not change the underlying Ray worker processes or autoscaler node images; it runs your tasks in containers. - All nodes need Docker access and credentials to pull the image (for private registries). - If you need GPUs inside the container, add

--gpus=all

(or equivalent) to

run_options

and ensure the host has NVIDIA drivers installed. - Network/volumes: use

run_options

and/or volume mounts as needed; ensure the container can reach any services your code needs. - Precedence: - Per-task

remote_options(runtime_env=...)

overrides the default

ray.init(runtime_env=...)

. - If you don’t set either, tasks run in the base environment of the node. When to consider other options - If you truly need to swap the base image of workers per flow, Option A (multiple clusters with different base images) or the Ray Jobs API (per-job

runtime_env

) may be operationally cleaner. - Ray Jobs API docs: Ray Jobs API Prefect-Ray docs - Overview and examples: Prefect-Ray integration If you confirm: - Docker is installed on all EC2 nodes - Your image registry and GPU requirements (if any) …I can help tailor the exact

runtime_env

and

run_options

you’ll need.

Saad80

10/09/2025, 10:42 PM

@Marvin Oh right I have tried option C, and guess what? It throws the error 'podman not found'. I want it to run via Docker. Not Podman. Also I am totally happy with this "- This does NOT replace the Ray worker process image or the EC2 AMI; instead, Ray launches your user code in containers on the nodes."

Marvin

10/09/2025, 10:42 PM

✅ thought for 63.5 seconds

Marvin

10/09/2025, 10:43 PM

Great — thanks for confirming you’re on plain EC2 and that Option C’s behavior (containers for user code, not swapping the base AMI) is acceptable. Two parts below: 1) Why you saw “podman not found” and how to force Docker 2) How Option C works with Prefect + Ray, with examples and docs 1) “podman not found” and forcing Docker - What’s happening: Ray’s runtime_env “container” support shells out to a container runtime on each node. If Docker isn’t available to the Ray processes (or not on PATH), Ray may try podman and you’ll get “podman not found”. - The fix is to ensure Docker is installed and available to the user running Ray on every node: - Install Docker Engine on every Ray node. - Add the Ray process user to the docker group, then re-login or restart Ray services: - Ubuntu example:

Copy code

sudo apt-get update
      sudo apt-get install -y <http://docker.io|docker.io>
      sudo usermod -aG docker $USER
      # re-login or: newgrp docker
      docker info  # should succeed without sudo

- Make sure the docker binary is on PATH for Ray’s processes. If Ray is launched by systemd or a service, ensure that environment has PATH to docker, or set it explicitly. - If podman is installed and Docker is not, you’ll see that error. Either install Docker and make sure it’s preferred on PATH, or remove podman. - Private registries (ECR/GCR): configure node-level auth so containers can pull images non-interactively (e.g., Docker credential helpers or a periodic ECR login). Ray docs on containerized tasks/actors (Option C): - Running tasks and actors in containers: Ray: Running tasks and actors in containers 2) Option C in depth with Prefect + Ray - What it does: Your Prefect flow (the Ray “driver”) connects to your existing cluster (ray://…). Ray then executes tasks/actors inside Docker containers on each node. - This does NOT change the Ray worker processes’ base image or your AMIs; it just containers your user code. - Requirements: - Docker installed and accessible on all nodes - Nodes can pull your image (auth if private) - If you need GPUs in the container, you’ll need NVIDIA drivers on the host and pass GPU flags A) Set a default container for the whole flow’s Ray session - Pass runtime_env via RayTaskRunner.init_kwargs:

Copy code

from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(
    task_runner=RayTaskRunner(
        address="ray://<head-ip>:10001",
        init_kwargs={
            "runtime_env": {
                "container": {
                    "image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
                    # optional: docker run flags
                    "run_options": [
                        "--gpus=all",     # if you need GPUs
                        "--ipc=host",
                        # "--volume=/host/path:/container/path"
                    ],
                    # pass env to the container
                    "env_vars": {"EXTRA_FLAG": "1"},
                }
            }
        },
    )
)
def my_flow():
    ...

B) Choose different images per task/section - Use Prefect-Ray’s remote_options context to override per task/actor:

Copy code

from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

@task
def heavy_gpu_work(x): ...
@task
def cpu_only_work(y): ...

@flow(task_runner=RayTaskRunner(address="ray://<head-ip>:10001"))
def my_flow(image_for_gpu: str, image_for_cpu: str):
    # GPU section
    with remote_options(
        runtime_env={"container": {"image": image_for_gpu, "run_options": ["--gpus=all"]}},
        num_gpus=1,
    ):
        a = heavy_gpu_work.submit(1)

    # CPU section
    with remote_options(
        runtime_env={"container": {"image": image_for_cpu}},
        num_cpus=2,
    ):
        b = cpu_only_work.submit(2)

    return a, b

Marvin

10/09/2025, 10:43 PM

How ray.init fits in -

ray.init()

with an address (e.g.,

ray://...

) connects to your existing cluster; it does not create a new one. Without an address, it starts a local Ray on the current machine. - Prefect’s RayTaskRunner calls

ray.init(...)

under the hood. So with your existing cluster, you’ll pass

address="ray://<head-ip>:10001"

to attach to it. - Docs: - Ray Client (ray://): Ray Client - ray.init reference: ray.init API Operational checklist for Docker runtime_env - On each EC2 node: - Docker installed and running (docker ps works without sudo) - The Ray process user belongs to docker group - PATH includes docker for Ray launch environment - Registry auth configured (if private images) - If using GPUs: - Host has NVIDIA drivers - Use run_options like

--gpus=all

- Test quickly: - On a node, manually

docker pull <your-image>

- In a simple Python shell connected via

ray.init(address="ray://...")

, run a trivial task with a

runtime_env={"container": {"image": "<your-image>"}}

If you can share your AMI/OS and how you launch Ray (ray up YAML, systemd, etc.), I can provide exact install and PATH steps to make Docker available to Ray on those nodes.

Saad80

10/09/2025, 10:53 PM

@Marvin I dont understand what you mean by this "- Make sure the docker binary is on PATH for Ray’s processes. If Ray is launched by systemd or a service, ensure that environment has PATH to docker, or set it explicitly.". I think there is a fundamental gap in my knowledge around what PATH means. Also, I do install docker and setup ecr credential helper in the base ami that I use for all my ray tasks. I just dont re-login or whatever. # Docker installation and setup exec_remote_cmd "sudo yum install -y docker" exec_remote_cmd "sudo systemctl enable docker" exec_remote_cmd "sudo systemctl start docker" exec_remote_cmd "sudo usermod -aG docker ec2-user" Also, tell me more on how I can do the simple python shell test? I would love to do that for quicker debugging.

Marvin

10/09/2025, 10:53 PM

✅ thought for 54.2 seconds

Marvin

10/09/2025, 10:54 PM

Totally fair questions — let’s clear up PATH, why Podman showed up, how to make Ray use Docker, and how to do a quick “containerized task” test from a Python shell. What PATH means (and why it matters here) - PATH is an environment variable that lists directories the OS searches for executables when you run a command without a full path. - Example: if PATH includes /usr/bin, running docker will execute /usr/bin/docker. - If PATH does not include the directory that contains docker, then a process calling docker will fail to find it even if Docker is installed. Quick ways to inspect it:

Copy code

echo $PATH
which docker

- If

which docker

prints nothing, the current shell cannot find Docker on PATH. - Processes started by services (like systemd) may have a different PATH than your interactive shell. That’s why something that works in your SSH session can fail for a service. Your current setup and the two common gotchas You’re doing:

Copy code

sudo yum install -y docker
sudo systemctl enable docker
sudo systemctl start docker
sudo usermod -aG docker ec2-user

Two important follow-ups: 1) Group membership refresh: Adding

ec2-user

to the

docker

group does not take effect until you start a new login session. Without that, the user (and any processes you start) won’t have permission to talk to Docker. Fix options: - Log out and log back in, or - Run

newgrp docker

in the provisioning script before starting Ray, or - Start Ray only after a fresh login where

groups

shows

docker

. 2) Ensure Docker is discoverable for the Ray processes: - When you start Ray (head and workers), do it from a shell/session where

which docker

works. - If you use a service/daemon to start Ray, make sure that service’s environment includes

/usr/bin

(where Docker typically lives on Amazon Linux) so

docker

is on PATH for that service. - If Podman is installed and Docker is not visible on PATH to the Ray process, Ray may attempt Podman and you’ll see “podman not found”. I couldn’t find an official Ray setting to “force Docker over Podman” — Ray detects the container runtime by checking what’s available on PATH. The practical fix is to ensure Docker is installed, on PATH, and accessible; or remove Podman so there’s no ambiguity. Simple Python shell test you can run on an EC2 node Pick any node (head is easiest). This verifies: - The process can see Docker on PATH - You can connect to your cluster - A task runs in your specified container image 1) Sanity checks on the node

Copy code

which docker
docker ps
groups

Expect

docker

found,

docker ps

works without sudo, and

docker

in the group list. If not, log out/in or

newgrp docker

and try again. 2) Test from Python Replace HEAD_ADDR with your Ray head’s address and IMAGE with your image. ``` python - << 'PY' import os, shutil, ray, subprocess, json HEAD_ADDR = "ray//<head ip>10001" IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) # Connect to your existing cluster (does not start a new one) ray.init(address=HEAD_ADDR) @ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0, } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options( runtime_env={ "container": { "image": IMAGE, # Uncomment if you need GPUs or other flags # "run_options": ["--gpus=all", "--ipc=host"], # "env_vars": {"EXAMPLE": "1"}, } }

Marvin

10/09/2025, 10:54 PM

).remote() print(json.dumps(ray.get(task), indent=2)) PY

Copy code

What to look for:
- The “Without container” and “With container” sections should differ in `os_release` (confirming your container image ran).
- If you get errors like “podman not found” or “docker: permission denied”:
  - Make sure `which docker` returns a path in both the driver and inside the non-container task.
  - Re-check group membership took effect (new login/newgrp).
  - Ensure Docker is installed on every worker node, not just the head.
  - Try pre-pulling the image on nodes (`docker pull <IMAGE>`) to rule out registry/auth issues.

Using this with Prefect + Ray
Once the node-level test works, apply the same runtime_env via Prefect-Ray:

- Default container for all tasks in a flow:

from prefect import flow from prefect_ray.task_runners import RayTaskRunner @flow( task_runner=RayTaskRunner( address="ray//<head ip>10001", init_kwargs={ "runtime_env": { "container": { "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310", # "run_options": ["--gpus=all"] } } }, ) ) def my_flow(): ...

Copy code

- Different images per section/task:

from prefect import flow, task from prefect_ray.task_runners import RayTaskRunner from prefect_ray.context import remote_options @task def work(x): ... @flow(task_runner=RayTaskRunner(address="ray//<head ip>10001")) def my_flow(image1: str, image2: str): with remote_options(runtime_env={"container": {"image": image1}}): a = work.submit(1) with remote_options(runtime_env={"container": {"image": image2}}): b = work.submit(2) return a, b ``` Relevant docs - Running Ray tasks/actors in containers: Ray docs: containerized tasks/actors - Ray Client (ray://) overview: Ray Client - ray.init API: ray.init reference - Prefect-Ray integration: Prefect-Ray If you share how you start Ray on EC2 (manual, autoscaler, or a service), I can suggest the best place to refresh group membership and guarantee PATH visibility so Docker is always found.

Saad80

10/09/2025, 11:02 PM

@Marvin I start ray using the config YAML. I call it like ray up ray-cluster-config-docker.yaml from my local machine. I have ray cli installed and aws logged in. So that starts a ray head node on ec2. Can you tell me step by step how to do the python/docker test on the head node? I have one up right now

Marvin

10/09/2025, 11:02 PM

✅ thought for 59.3 seconds

Marvin

10/09/2025, 11:03 PM

Awesome — since you’re using the Ray cluster launcher (ray up), here’s a concrete, step-by-step “Docker + Python” test you can run right now on the head node to validate Option C and debug PATH/permissions. Quick note on PATH (since you asked) - PATH is an environment variable listing directories where the system looks for executables. If

docker

isn’t in any of those directories, the process won’t find it. - Your interactive SSH session and the Ray worker processes may have different PATHs and group memberships. That’s why “docker works for me in SSH” can still fail for Ray tasks. Step-by-step on the head node 0) Attach to the head node - From your laptop:

Copy code

ray attach ray-cluster-config-docker.yaml

This opens an SSH session to the head. 1) Verify Docker is usable without sudo

Copy code

which docker
docker ps
id -nG

- If

docker ps

gives “permission denied” and you just added the user to the

docker

group, refresh group membership in this session:

Copy code

newgrp docker
docker ps

- If

which docker

returns nothing, Docker isn’t on PATH in this shell. On Amazon Linux, Docker usually lives in

/usr/bin

. You can try:

Copy code

export PATH="/usr/bin:$PATH"
which docker

Tip: Ray worker processes inherit env from the shell that started Ray. If Docker wasn’t on PATH or group membership wasn’t ready when Ray started, you may need to restart Ray after fixing it (see step 6). 2) Sanity check image pull (optional but recommended) Replace IMAGE with your ECR image.

Copy code

docker pull <http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>

If this fails with auth, ensure your instances have an IAM role with ECR read and that your ECR credential helper is configured. 3) Verify Ray is up and you can connect locally

Copy code

ray status
python -c "import ray; print(ray.__version__)"

If Python can’t import Ray, use

python3

or activate the env that Ray installed into (depends on your cluster YAML’s setup_commands). Often

python

already works. 4) Run the minimal Python test that compares “no container” vs “container” execution Replace IMAGE with yours. This connects to the existing cluster (does not start a new one) and runs a tiny task twice: once normally, once inside your container.

Copy code

python - << 'PY'
import os, shutil, ray, subprocess, json

IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310"

print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))

# Connect to the local running cluster on the head
ray.init(address="auto")

@ray.remote
def info():
    import os, subprocess, shutil
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
        "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0,
    }

print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))

print("\n--- With container runtime_env ---")
task = info.options(
    runtime_env={
        "container": {
            "image": IMAGE,
            # Uncomment/add if needed:
            # "run_options": ["--gpus=all", "--ipc=host"],
            # "env_vars": {"EXAMPLE": "1"},
        }
    }
).remote()

print(json.dumps(ray.get(task), indent=2))
PY

What success looks like - The “With container runtime_env” output should show a different

os_release

that matches your container base. That proves tasks ran inside your image. -

which_docker

in the “Without container” section should be non-null. If it’s null, the Ray worker environment doesn’t see Docker on PATH. Common fixes if you still see “podman not found” or “docker not found/permission denied” - Docker not found: ensure

/usr/bin

is on PATH and Docker is installed on ALL nodes (head + workers).

Marvin

10/09/2025, 11:03 PM

- Permission denied: ensure the Ray process user is in the

docker

group; after adding, start a new session (or use

newgrp docker

) and restart Ray so worker processes pick up the change. - You may need to restart the cluster processes after fixing group/PATH: - From the head:

ray stop

then re-run

ray up ray-cluster-config-docker.yaml

- Or just re-run

ray up

from your laptop to redeploy/start Ray. - Private ECR: make sure instances have IAM role with ECR pull perms, or that your ECR credential helper is configured under the same user that runs Ray. Using this in Prefect after the test works - Set the runtime_env globally for the flow:

Copy code

from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(task_runner=RayTaskRunner(
    address="ray://<head-ip>:10001",  # or address="auto" if running on head
    init_kwargs={
        "runtime_env": {
            "container": {
                "image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
                # "run_options": ["--gpus=all"]
            }
        }
    }
))
def my_flow():
    ...

- Or choose different images per task section using `remote_options`:

Copy code

from prefect_ray.context import remote_options
with remote_options(runtime_env={"container": {"image": "your-other-image"}}):
    ...

Official docs for what we’re using - Running Ray tasks/actors in containers (runtime_env): Ray: containerized tasks/actors - Ray Client and addresses: Ray Client - ray.init reference: ray.init API - Prefect-Ray integration: Prefect-Ray If you hit an error during the test, paste the outputs for: -

which docker

docker ps

id -nG

- The JSON from both “Without container” and “With container” sections …and I’ll help pinpoint whether it’s PATH, group membership, image pull, or something else.

Saad80

10/09/2025, 11:18 PM

@Marvin So everything was perfect (docker had the path, ec2-user was in the group etc) but here is the output of the python test script: python test.py Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-09 231621,786 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.17.173:6379... 2025-10-09 231621,803 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.17.173:8265 --- Without container --- (autoscaler +10s) Tip: use

ray status

to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +10s) Adding 1 node(s) of type ray.worker.default. { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found (autoscaler +40s) Resized to 2 CPUs. (raylet, ip=172.31.25.199) [2025-10-09 231759,241 E 5840 5840] (raylet) worker_pool.cc:553: Some workers of the worker process(5967) have not registered within the timeout. The process is dead, probably it crashed during start. (raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found

Marvin

10/09/2025, 11:18 PM

✅ thought for 96.1 seconds

Marvin

10/09/2025, 11:20 PM

Thanks for the detailed output — that helps a lot. What your test shows - The “without container” task ran fine. - The “with container” task was scheduled on a new worker (ip=172.31.25.199) and failed with: - bash: exec: podman: not found - This strongly suggests that on that worker node, Ray’s container runtime detection didn’t find Docker (or Docker wasn’t usable from the raylet’s environment), so it tried Podman and… Podman isn’t installed either. Why this happens with ray up on EC2 - The autoscaler can bring up new worker nodes after you connect. If your worker AMI does not have Docker installed/configured, or your cluster YAML only installs Docker on the head, then any tasks that require the container runtime_env will fail on those workers. - The head showed Docker on PATH, but the failing node is a different worker. We need to verify Docker on workers too. Step-by-step: validate and fix on the worker node 1) SSH to the worker (the failing IP from your logs is 172.31.25.199) - From your workstation, use the same key and user as your ray cluster config:

Copy code

ssh -i /path/to/your/key.pem ec2-user@172.31.25.199

2) Check Docker availability and permissions

Copy code

which docker
docker ps
id -nG

- You want: - which docker -> /usr/bin/docker (or similar) - docker ps works without sudo - your user (e.g., ec2-user) is in the docker group 3) If Docker is missing or needs setup on the worker - Install and enable:

Copy code

sudo yum install -y docker
sudo systemctl enable --now docker
sudo usermod -aG docker ec2-user
# Open a fresh login OR run:
newgrp docker
docker ps

- If you’re pulling from ECR, make sure the instance has an IAM role with ECR pull perms and that your credential helper is configured for the same user that runs Ray. 4) Make it automatic for all future workers - Update your Ray cluster YAML so workers get Docker too, not just the head. Use setup_commands (applies to head+workers) or worker_setup_commands. Example (simplified):

Copy code

# ray-cluster-config-docker.yaml (excerpt)
auth:
  ssh_user: ec2-user

setup_commands:
  - sudo yum install -y docker
  - sudo systemctl enable --now docker
  - sudo usermod -aG docker ec2-user
  # ensure current shell has docker group for subsequent commands
  - newgrp docker || true
  - docker --version

- Alternatively, bake Docker + group membership into the AMI used for BOTH head and workers, and reference that AMI in available_node_types for workers. 5) Restart Ray to ensure new nodes inherit the correct env/groups - After adjusting your setup (or AMI), restart the cluster so new workers are provisioned with Docker: - Simplest: scale down to zero workers, then back up; or tear down and

ray up

again against the updated YAML. Re-run the quick Python test on the head - This time, force it to use a worker (by letting the autoscaler bring one up) and confirm the containerized task works. For convenience, here’s the same test block you can paste on the head (replace IMAGE): ``` python - << 'PY' import os, shutil, ray, subprocess, json IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) ray.init(address="auto") @ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0, } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options( runtime_env={ "container": { "image": IMAGE, # "run_options": ["--gpus=all", "--ipc=host"], # if needed }

Marvin

10/09/2025, 11:20 PM

} ).remote() print(json.dumps(ray.get(task), indent=2)) PY ``` Notes on PATH and why Podman appeared - Ray chooses a container runtime by probing what’s available to the raylet/worker process on each node. If Docker isn’t installed, not on PATH, or not usable, Ray may attempt Podman. In your case, the worker didn’t have a usable Docker context, so it tried Podman and failed. - Each node is independent: your head node having Docker on PATH doesn’t guarantee workers do. Docs to back this up - Running tasks/actors in containers (Ray runtime_env): Ray: containerized tasks/actors - Ray Client (connecting to existing clusters, ray://): Ray Client - ray.init reference: ray.init API - Prefect-Ray integration: Prefect-Ray integration To summarize your next moves - Verify Docker on the failing worker (172.31.25.199). Fix/install if needed. - Ensure your cluster YAML provisions Docker for ALL workers (setup_commands or AMI). - Restart/scale the cluster and re-run the Python test. - Once it works in Ray, use the same runtime_env in your Prefect flow’s

RayTaskRunner(init_kwargs={...})

. If you share your cluster YAML (sanitized) I can point to the exact section to add the worker Docker setup so the autoscaler always brings up container-ready nodes.

Saad80

10/09/2025, 11:27 PM

@Marvin so I logged into my worker and tried all the docker commands. They work perfectly: [ec2-user@ip-172-31-25-199 ~]$ which docker /usr/bin/docker [ec2-user@ip-172-31-25-199 ~]$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES [ec2-user@ip-172-31-25-199 ~]$ id -nG ec2-user adm wheel systemd-journal docker [ec2-user@ip-172-31-25-199 ~]$ docker pull 971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231 pd231: Pulling from prefect-flows bc0965b23a04: Pull complete c58b365c3bcb: Pull complete 3bc3f7229179: Pull complete 42dacf6a4601: Pull complete 170afa08c0a6: Pull complete 14e285c56f13: Pull complete 6ac45919b651: Pull complete bfe622ecd2e2: Pull complete 64d7dbf1fd50: Pull complete Digest: sha256:87730a4abd607ef4774f8ea840de2e1d2f909ef0b7e9b1aecce8c584b445a6b9 Status: Downloaded newer image for 971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231 971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231 but the test on the head is still not working: python test2.py Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-09 232648,286 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.17.173:6379... 2025-10-09 232648,293 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.17.173:8265 --- Without container --- { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found

Marvin

10/09/2025, 11:28 PM

✅ thought for 96.3 seconds

Marvin

10/09/2025, 11:29 PM

This is super helpful. Given your config file is named ray-cluster-config-docker.yaml, I’m 99% sure Ray is running inside a Docker container on each node. That explains the symptom: - On the host, docker works fine (as you showed). - But the raylet process is inside a Ray container that likely does not have the docker CLI, nor the Docker socket mounted. Inside that container, Ray tries to find a container runtime. It doesn’t see docker, falls back to podman, and you get “podman: not found”. Quick check on the worker to confirm - See if Ray is running in a Docker container: -

docker ps | grep -i ray

(you should see a container for ray head/worker) - Or:

ps -ef | grep raylet

then

sudo nsenter -t <raylet-pid> -m -u -i -n env | grep PATH

(advanced) - Also, jump into the Ray container and check docker availability: -

docker exec -it <ray-container-name> bash -lc 'which docker || echo no-docker; ls -l /var/run/docker.sock || echo no-sock'

Why your Python test fails - The containerized tasks feature (runtime_env["container"]) shells out to a container runtime from the raylet. Since raylet is inside a container without docker, it can’t call docker on the host unless you: - install a docker CLI inside the Ray container, and - mount the host’s Docker socket into the Ray container, and - ensure the user in the Ray container can access the socket. Two ways to fix it Option 1 — Run Ray directly on the host (simplest) - Remove the docker: section from your cluster YAML so Ray doesn’t run inside a container. - Recreate/scale your cluster. - Your existing runtime_env container test should start working immediately because raylet will now see /usr/bin/docker on the host. Option 2 — Keep Ray in Docker, but give it access to Docker - Update the cluster YAML’s docker section to: - Use a Ray image that has the docker CLI installed, or extend it yourself: - Example Dockerfile:

Copy code

FROM rayproject/ray:2.48.0-py310
      RUN apt-get update && apt-get install -y <http://docker.io|docker.io> && rm -rf /var/lib/apt/lists/*

- Mount the Docker socket and any needed privileges: - In your YAML docker.run_options, add: -

--volume=/var/run/docker.sock:/var/run/docker.sock

--ipc=host

- Optional:

--privileged

(or granular capabilities if needed) - Ensure the container user can talk to the socket: - The Docker socket is typically owned by group “docker” on the host. You can: - Run the Ray container as root (quick and dirty), or - Create a matching group inside the container with the host’s docker group GID and add the container user to it. - After these changes, redeploy/scale your cluster so new Ray containers have docker and the socket. After applying either option, re-run the Python test on the head - Same test you used; it should now run the “With container runtime_env” section successfully and you should see a different os_release (from your image). If you want to verify the raylet’s environment right now - On the worker: - Find the raylet PID:

pgrep -f raylet

- Inspect its env PATH:

sudo tr '\0' '\n' </proc/$(pgrep -f raylet | head -n1)/environ | grep PATH

- If Ray is running in Docker, that PATH will be the container’s PATH and won’t include docker unless you install it inside that container. Docs for what we’re using - Running tasks and actors in containers (runtime_env): Ray: containerized tasks/actors - Ray Client (connecting to an existing cluster): Ray Client - ray.init reference: ray.init API - Prefect-Ray integration: Prefect-Ray integration

Marvin

10/09/2025, 11:29 PM

If you share the docker: section of your ray-cluster-config-docker.yaml (sanitized), I can suggest exact run_options and a minimal Dockerfile tweak so Ray-in-Docker can call Docker-on-host.

Saad80

10/09/2025, 11:31 PM

@Marvin whats the pid? docker ps | grep -i ray [ec2-user@ip-172-31-17-173 ~]$ ps -ef | grep raylet ec2-user 7004 1 0 23:09 ? 000008 /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --store_socket_name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --object_manager_port=8076 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=172.31.17.173 --maximum_startup_concurrency=1 --static_resource_list=node172.31.17.173,1.0,node__internal_head__,1.0,memory,301434471,object_store_memory,150717235 --python_worker_command=/home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/_private/workers/setup_worker.py /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/_private/workers/default_worker.py --node-ip-address=172.31.17.173 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --raylet-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --redis-address=None --temp-dir=/tmp/ray --metrics-agent-port=56338 --runtime-env-agent-port=54436 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --runtime-env-agent-port=54436 --gcs-address=172.31.17.173:6379 --session-name=session_2025-10-09_23-08-47_338911_6867 --temp-dir=/tmp/ray --webui=172.31.17.173:8265 --cluster-id=967dc0aa760cbb1b48760385fd50882011274a5bcded820f04f5add5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command= --native_library_path=/home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/cpp/lib --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867 --log_dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --resource_dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/runtime_resources --metrics-agent-port=56338 --metrics_export_port=51597 --runtime_env_agent_port=54436 --object_store_memory=150717235 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=172.31.17.173:6379 --session-name=session_2025-10-09_23-08-47_338911_6867 --labels= --cluster-id=967dc0aa760cbb1b48760385fd50882011274a5bcded820f04f5add5 --head --num_prestart_python_workers=0 --dashboard_agent_command=/home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 -u /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/dashboard/agent.py --node-ip-address=172.31.17.173 --metrics-export-port=51597 --dashboard-agent-port=56338 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --raylet-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867 --log-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2025-10-09_23-08-47_338911_6867 --gcs-address=172.31.17.173:6379 --runtime_env_agent_command=/home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 -u /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=172.31.17.173 --runtime-env-agent-port=54436 --gcs-address=172.31.17.173:6379 --runtime-env-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/runtime_resources --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --log-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --temp-dir=/tmp/ray ec2-user 7041 7004 1 23:09 ? 000018 /home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 -u /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/dashboard/agent.py --node-ip-address=172.31.17.173 --metrics-export-port=51597 --dashboard-agent-port=56338 --listen-port=52365 --node-manager-port=46559 --object-store-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --raylet-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867 --log-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2025-10-09_23-08-47_338911_6867 --gcs-address=172.31.17.173:6379 --agent-id 424238335 ec2-user 8159 7438 0 23:31 pts/1 000000 grep --color=auto raylet

Marvin

10/09/2025, 11:31 PM

✅ thought for 91.4 seconds

Marvin

10/09/2025, 11:33 PM

Nice — thanks for the details. Two key points based on your output: - The head’s raylet PID is 7004 (from your ps output). But the failure happens on the worker at 172.31.25.199, so we need to inspect the raylet environment on the worker. - Even though docker works interactively on the worker, Ray is invoking the container runtime from the worker’s raylet/runtimenv-agent process. If that process’s environment cannot find docker on PATH at launch time, it’ll try podman and you see “podman: not found”. Let’s verify and fix, step by step. A) Inspect the worker’s raylet environment (172.31.25.199) Run these on the worker:

Copy code

# 1) Confirm raylet PID (on the worker)
pgrep -f raylet

# 2) Show the PATH visible to the raylet process (replace <PID> with output above)
sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH=

# 3) Sanity check your current shell
echo $PATH
which docker
id -nG

# 4) Check runtime env agent logs for clues
ls -1 /tmp/ray/session_*/logs | grep runtime_env
grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100

If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet, that explains why it can’t find docker at container launch time even though you can find it interactively. B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect), but force the task onto 172.31.25.199:

Copy code

python - << 'PY'
import ray, json, os, shutil

ray.init(address="auto")

# Make an info task
@ray.remote
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
    }

# Force the task to run on the target worker via node resource
target_ip = "172.31.25.199"
res = ray.get(info.options(resources={f"node:{target_ip}": 0.001}).remote())
print(json.dumps(res, indent=2))
PY

which_docker

comes back null here, the environment that Ray gives to workers on that node cannot see docker. C) How to fix (common causes + remedies) - PATH not present for Ray processes: - The Ray autoscaler starts ray processes with a different environment than your interactive shell. Ensure /usr/bin is on PATH in the environment used to start ray. You can enforce PATH in your cluster YAML start commands: - Example (add to worker_start_ray_commands and head_start_ray_commands):

Copy code

head_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --head --dashboard-host=0.0.0.0 --port=6379'
    worker_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --address=$RAY_HEAD_IP:6379'

Adjust flags to match your config. Re-run

ray up

after editing. - Group membership timing: - If you add ec2-user to the docker group in setup_commands, the membership won’t apply to already-running processes. Make sure Docker install + usermod happen before Ray starts, and start Ray only in a fresh session where

groups

shows docker. A simple way: - Do the docker install/usermod in setup_commands - In start_ray_commands, do not restart usermod; just start Ray. If needed, force a login shell or use

newgrp docker

before

ray start

. - Confirm docker and ECR access for the raylet: - After you fix PATH, try the same Python “with container” test again. If you hit auth issues, pre-pull the image on the worker and/or ensure the instance profile gives ECR pull permissions. D) Re-run the container test (on the head) Same script you used before; it should now succeed: ``` python - << 'PY' import os, shutil, ray, subprocess, json IMAGE = "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) ray.init(address="auto")

Marvin

10/09/2025, 11:33 PM

@ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options(runtime_env={"container": {"image": IMAGE}}).remote() print(json.dumps(ray.get(task), indent=2)) PY

Copy code

E) Why “podman: not found” despite docker being installed
- Ray’s containerized tasks feature shells out from the raylet/runtime-env agent. If the raylet’s environment cannot find docker (binary not on PATH at the time the process was started), Ray may try podman. Your interactive shell is fine, but the raylet process environment may be missing PATH entries.
- Ensuring PATH (and group membership) for the process that starts Ray fixes this.

Once this test works, you can use the same runtime_env in Prefect-Ray’s RayTaskRunner:
- Global for the flow:

@flow(task_runner=RayTaskRunner( address="ray//<head ip>10001", init_kwargs={ "runtime_env": {"container": {"image": "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"}} }, )) def my_flow(): ... ``` - Or per section with

remote_options

. Docs - Ray: Running tasks/actors in containers: Ray docs: containerized tasks/actors - Ray Client (ray://): Ray Client - ray.init reference: ray.init API - Prefect-Ray: Prefect-Ray integration If you can paste: - The PATH seen by the worker’s raylet (proc environ), and - The output of the pinned info task on 172.31.25.199, …I can advise exactly which YAML changes to make so every autoscaled worker reliably sees docker.

9 Views

Open in Slack

Previous Next