<@ULVA73B9P> I need to get my ray cluster to pick ...
# ask-marvin
s
@Marvin I need to get my ray cluster to pick up what docker container to use depending on what I define in my flow. How do I do that?
m
thought for 272.3 seconds
Great question — there are two “images” at play when you combine Prefect and Ray. Which one do you want to switch per flow? 1) The container image that runs your Prefect flow run (i.e., what your Prefect worker launches) - You can set this per-deployment or even per-run. This is usually what you want if your flow container needs different Python/system deps before it connects to a Ray cluster. - Options: - At deploy time: set the image on the deployment. - Per run: override with
job_variables
. Example: override the image at trigger time
Copy code
from prefect.deployments import run_deployment

run_deployment(
    name="my-flow/my-deployment",
    parameters={"some_param": "value"},
    job_variables={"image": "<http://ghcr.io/your-org/prefect-ray:py310-cu121|ghcr.io/your-org/prefect-ray:py310-cu121>"},
)
If you deploy in Python (Prefect 3.x), you can also set the image when creating the deployment:
Copy code
# sketch
my_flow.from_source(
    source="<https://github.com/your-org/your-repo>",
    entrypoint="flows/ray_flow.py:my_flow",
).deploy(
    name="ray-flow",
    work_pool_name="kubernetes-pool",  # or docker, ecs, etc.
    image="<http://ghcr.io/your-org/prefect-ray:py310|ghcr.io/your-org/prefect-ray:py310>",
)
Docs: Customize job variables 2) The container image used by the Ray cluster to execute Ray tasks/actors - Prefect cannot change this per flow run. The Ray cluster’s worker pods/processes define their image (e.g., in KubeRay via the RayCluster spec). The Prefect flow can connect to that cluster, but image selection for Ray workers is a Ray/Kubernetes concern. - If you need different images for different workloads on Ray, typical approaches are: - Run multiple Ray clusters (each with its own image) and point your flow’s
RayTaskRunner(address=...)
to the desired cluster. - In Kubernetes with KubeRay, define multiple worker groups with different images and route tasks using Ray resource tags/placement (Ray-specific pattern). - If you’re using Ray Jobs and your environment supports it, submit jobs with a runtime environment that specifies the image (Ray feature; outside of Prefect-Ray’s task runner). Minimal Prefect + Ray example (Prefect controls the flow image; Ray runs tasks on its own cluster):
Copy code
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(task_runner=RayTaskRunner(address="<ray://your-ray-head:10001>"))
def my_flow():
    # your tasks here; they will run on the Ray cluster
    ...

# Deploy this flow with different images per deployment or override per run via job_variables
Prefect-Ray docs: Prefect-Ray integration To make sure I point you to the exact setup: - Where is your Ray cluster running (Kubernetes via KubeRay, VMs, etc.)? - Which work pool type are you using for Prefect (Docker, Kubernetes, ECS)? - Do you want to dynamically choose the Prefect flow’s container image, or the Ray cluster’s worker image?
s
@Marvin what about using # init_kwargs={"runtime_env": {"image_uri": get_ray_image_uri()}} # for ray 2.48.0 # init_kwargs={"runtime_env": {"container": {"image": get_ray_image_uri()}}}?
m
thought for 114.2 seconds
Short answer: Passing
init_kwargs
to
RayTaskRunner
works, but it won’t change the image of an existing Ray cluster’s workers. Whether your snippet does what you want depends on how you’re using Ray. - If you’re connecting to an existing Ray cluster (e.g.,
address="ray://..."
via KubeRay): neither -
init_kwargs={"runtime_env": {"image_uri": ...}}
nor -
init_kwargs={"runtime_env": {"container": {"image": ...}}}
will switch the Kubernetes pod image. The worker image is fixed by the RayCluster spec (KubeRay) or how the cluster was started. You’ll need to: - Run separate Ray clusters with different images and choose the cluster per flow (via
RayTaskRunner(address=...)
), or - Define multiple worker groups in the RayCluster with different images and route tasks using Ray resource tags/placement. - If you’re starting Ray locally (no
address
) and want Ray to run your tasks inside a container on the same host: -
init_kwargs={"runtime_env": {"container": {"image": "your-image"}}}
is the Ray runtime env containerization form. This requires Docker on the host and uses Ray’s container runtime env plugin. It does not affect remote KubeRay clusters. - If you’re using Ray Jobs (job submission API), Ray supports specifying an image at the job level via the job’s runtime env (commonly
image_uri
). Prefect-Ray’s
RayTaskRunner
does not submit Ray Jobs; it calls
ray.init(...)
, so job-level image overrides don’t apply here. How to pass init_kwargs with Prefect-Ray:
Copy code
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

# Connect to an existing cluster (won’t change its pod image)
@flow(task_runner=RayTaskRunner(address="<ray://ray-head:10001>",
                                init_kwargs={"runtime_env": {"env_vars": {"FOO": "bar"}}}))
def my_flow():
    ...

# Start a local Ray runtime and run tasks in a Docker image (host must have Docker)
@flow(task_runner=RayTaskRunner(init_kwargs={
    "runtime_env": {"container": {"image": "<http://ghcr.io/your-org/ray-runtime:py310|ghcr.io/your-org/ray-runtime:py310>"}}
}))
def local_containerized_flow():
    ...
If your goal is to pick the container per flow run: - For the Prefect flow runtime container: override the deployment’s image per run using `job_variables`:
Copy code
from prefect.deployments import run_deployment

run_deployment(
    name="my-flow/my-deployment",
    job_variables={"image": "<http://ghcr.io/your-org/prefect-flow:py310-cu121|ghcr.io/your-org/prefect-flow:py310-cu121>"},
)
Docs: Customize job variables - For the Ray worker image: configure it in Ray/KubeRay (e.g., RayCluster spec or separate clusters) and select which cluster to use in your flow via
RayTaskRunner(address=...)
. If you can share: - Are you using KubeRay? - Do you want to switch the Prefect flow’s container image, the Ray cluster’s worker image, or both? …I can suggest the exact config and example YAML/Python.
s
@Marvin • using plain ray cluster deployed via ec2 not kuberay. Should I be using KubeRay? • I want to switch the ray cluster worker image depending on the flow
m
thought for 54.9 seconds
Thanks — this helps a lot. Directly answering your snippets: -
init_kwargs={"runtime_env": {"image_uri": ...}}
— this is for Ray Jobs (job submission API). It sets the container image for the job’s runtime env when you submit a job to the Ray Job server. -
init_kwargs={"runtime_env": {"container": {"image": ...}}}
— this is the Ray runtime env “container” option for
ray.init(...)
. It can run tasks/actors in that container image if Docker is available on every node and Ray’s container runtime is supported in your setup. Important limitation on plain EC2 clusters: - Neither of those will “swap” the image of an already-running Ray cluster’s worker processes started by the autoscaler. They only affect the runtime env for your job/tasks. If your cluster is launched with a fixed Docker image (via Ray cluster launcher YAML) or a specific AMI, that’s the base image workers use. Given your goals (switch the Ray worker image per flow) on EC2, here are your practical options: Option A — Multiple Ray clusters (recommended on plain EC2) - Stand up multiple Ray clusters, each using the desired worker image (via the Ray cluster launcher YAML’s
docker
section or different AMIs). - In your Prefect flow, point
RayTaskRunner(address=...)
to the cluster you want per flow/deployment. A simple pattern is to read the address from an env var and set it per deployment/job:
Copy code
# flow.py
import os
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

RAY_ADDRESS = os.getenv("RAY_ADDRESS")  # e.g., ray://<head-ip>:10001

@flow(task_runner=RayTaskRunner(address=RAY_ADDRESS))
def my_flow():
    ...

# When deploying, set the env var per deployment/run
# e.g., via job_variables
# job_variables={"env": {"RAY_ADDRESS": "<ray://ec2-gpu-head:10001>"}}
- This lets each Prefect deployment (or run) select a different Ray cluster (and thus a different worker image). Option B — Use the Ray Jobs API with per-job image - Enable the Ray Job server on your EC2 cluster and submit jobs with
runtime_env={"image_uri": "<your-image>"}
. This sets the image for that job run. - You can trigger jobs from a Prefect flow:
Copy code
from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://<head-node-ip>:8265")
job_id = client.submit_job(
    entrypoint="python -c 'import your_module; your_module.main()'",
    runtime_env={"image_uri": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/ray:gpu|123456789012.dkr.ecr.us-east-1.amazonaws.com/ray:gpu>"},
    # env_vars / working_dir / pip, etc. as needed
)
- Note: Prefect-Ray’s
RayTaskRunner
does not submit Ray Jobs; it calls
ray.init(...)
. If you want per-job images without multiple clusters, Jobs is the cleanest route. Option C — Ray runtime_env “container” for tasks/actors - If all cluster nodes have Docker and Ray supports containerized workers in your setup, you can try:
Copy code
@flow(task_runner=RayTaskRunner(
    address="ray://<head-ip>:10001",
    init_kwargs={"runtime_env": {"container": {"image": "your-image:tag"}}},
))
def my_flow():
    ...
- This can run your tasks/actors inside that image, but it does not change the base worker processes/image that the autoscaler brings up. Operationally, this is more fragile on plain EC2 than Options A/B. Should you use KubeRay? - Not required, but it makes this use case much easier at scale. You can: - Run multiple RayCluster CRs with different worker images and choose per flow via
address
, or - Define multiple worker groups in one cluster with different images and route tasks via Ray resource tags/placement. - If you anticipate many variants or frequent changes, KubeRay is the smoother path. Prefect bits you’ll likely use: - Per-run flow container image (not Ray workers) via job variables:
Copy code
from prefect.deployments import run_deployment

run_deployment(
    name="my-flow/my-deployment",
    job_variables={"image": "<http://ghcr.io/your-org/prefect-flow:py310-cu121|ghcr.io/your-org/prefect-flow:py310-cu121>"},
)
- Docs:
- Prefect-Ray integration: Prefect-Ray - Prefect job variables: Customize job variables - Ray Jobs API: Ray Jobs API - Ray runtime_env and containers: Ray runtime_env containers If you can share how you start your EC2 cluster (Ray cluster launcher YAML? Docker section? AMIs?) and whether you can enable the Job server, I can propose a concrete path and example configs.
s
@Marvin I want to try Option C. Can you explain it more in depth and give some documentation/links to support what you are saying? Also, wym "`RayTaskRunner` does not submit Ray Jobs; it calls
ray.init(...)
."? I already have an existing ray cluster. Doesnt init mean it starts a new cluster or something?
m
thought for 73.8 seconds
Great questions — let me break down both parts. 1) What ray.init does (and what RayTaskRunner does with it) -
ray.init()
behavior: - Without an address: starts a local Ray instance on the current machine. - With an address (e.g.,
address="ray://<head-ip>:10001"
): connects to an existing cluster via Ray Client and does NOT start a new cluster. - Prefect’s
RayTaskRunner
calls
ray.init(...)
under the hood. If you provide an address, it attaches to your existing cluster; if not, it’ll start a local Ray on the worker. - Docs: - Ray Client/`ray://` overview: Ray Client -
ray.init
API: ray.init reference 2) Option C (run tasks/actors in a container image via runtime_env) — details and how to use it What it does - You can ask Ray to execute tasks/actors inside a Docker container on each node by setting a runtime environment with a container spec. - This does NOT replace the Ray worker process image or the EC2 AMI; instead, Ray launches your user code in containers on the nodes. - You can set: - A default container for your session/driver (via
ray.init(runtime_env=...)
). - Per-task/actor overrides (via
remote_options(runtime_env=...)
). - Requirements: - Docker must be installed and usable on every Ray node. - Nodes must be able to pull your image (ECR/GCR auth, etc.). - Your image should include a compatible Python + dependencies for your code. In practice, include
ray
too. Official Ray docs - Runtime environments (containers): Running tasks and actors in containers How to use with Prefect-Ray A. Set a default container for all Prefect tasks executed by this flow run: - Pass
init_kwargs={"runtime_env": {"container": {"image": "...", ...}}}
to
RayTaskRunner
. This applies to all Ray tasks created by the driver (your Prefect flow).
Copy code
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(
    task_runner=RayTaskRunner(
        address="ray://<head-ip>:10001",
        init_kwargs={
            "runtime_env": {
                "container": {
                    "image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
                    # optional: pass Docker run flags
                    "run_options": [
                        "--gpus=all",           # if you need GPUs
                        "--ipc=host",
                        # "--volume=/host/path:/container/path"
                    ],
                    "env_vars": {"EXTRA_FLAG": "1"},
                }
            }
        },
    )
)
def my_flow():
    # Prefect tasks in this flow will run as Ray tasks inside the container image above
    ...
B. Choose the image per task (or per section) at runtime - Prefect-Ray provides a
remote_options
context manager that maps to Ray’s per-task
.options(...)
. This lets you pick different images per task or per run based on a flow parameter or environment variable.
Copy code
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

@task
def heavy_gpu_work(x): ...
@task
def cpu_only_work(y): ...

@flow(task_runner=RayTaskRunner(address="ray://<head-ip>:10001"))
def my_flow(image_for_gpu: str, image_for_cpu: str):
    # GPU section
    with remote_options(
        runtime_env={
            "container": {
                "image": image_for_gpu,
                "run_options": ["--gpus=all"],
            }
        },
        num_gpus=1,
    ):
        a = heavy_gpu_work.submit(1)

    # CPU section
    with remote_options(
        runtime_env={"container": {"image": image_for_cpu}},
        num_cpus=2,
    ):
        b = cpu_only_work.submit(2)

    return a, b
Notes and gotchas
- This does not change the underlying Ray worker processes or autoscaler node images; it runs your tasks in containers. - All nodes need Docker access and credentials to pull the image (for private registries). - If you need GPUs inside the container, add
--gpus=all
(or equivalent) to
run_options
and ensure the host has NVIDIA drivers installed. - Network/volumes: use
run_options
and/or volume mounts as needed; ensure the container can reach any services your code needs. - Precedence: - Per-task
remote_options(runtime_env=...)
overrides the default
ray.init(runtime_env=...)
. - If you don’t set either, tasks run in the base environment of the node. When to consider other options - If you truly need to swap the base image of workers per flow, Option A (multiple clusters with different base images) or the Ray Jobs API (per-job
runtime_env
) may be operationally cleaner. - Ray Jobs API docs: Ray Jobs API Prefect-Ray docs - Overview and examples: Prefect-Ray integration If you confirm: - Docker is installed on all EC2 nodes - Your image registry and GPU requirements (if any) …I can help tailor the exact
runtime_env
and
run_options
you’ll need.
s
@Marvin Oh right I have tried option C, and guess what? It throws the error 'podman not found'. I want it to run via Docker. Not Podman. Also I am totally happy with this "- This does NOT replace the Ray worker process image or the EC2 AMI; instead, Ray launches your user code in containers on the nodes."
m
thought for 63.5 seconds
Great — thanks for confirming you’re on plain EC2 and that Option C’s behavior (containers for user code, not swapping the base AMI) is acceptable. Two parts below: 1) Why you saw “podman not found” and how to force Docker 2) How Option C works with Prefect + Ray, with examples and docs 1) “podman not found” and forcing Docker - What’s happening: Ray’s runtime_env “container” support shells out to a container runtime on each node. If Docker isn’t available to the Ray processes (or not on PATH), Ray may try podman and you’ll get “podman not found”. - The fix is to ensure Docker is installed and available to the user running Ray on every node: - Install Docker Engine on every Ray node. - Add the Ray process user to the docker group, then re-login or restart Ray services: - Ubuntu example:
Copy code
sudo apt-get update
      sudo apt-get install -y <http://docker.io|docker.io>
      sudo usermod -aG docker $USER
      # re-login or: newgrp docker
      docker info  # should succeed without sudo
- Make sure the docker binary is on PATH for Ray’s processes. If Ray is launched by systemd or a service, ensure that environment has PATH to docker, or set it explicitly. - If podman is installed and Docker is not, you’ll see that error. Either install Docker and make sure it’s preferred on PATH, or remove podman. - Private registries (ECR/GCR): configure node-level auth so containers can pull images non-interactively (e.g., Docker credential helpers or a periodic ECR login). Ray docs on containerized tasks/actors (Option C): - Running tasks and actors in containers: Ray: Running tasks and actors in containers 2) Option C in depth with Prefect + Ray - What it does: Your Prefect flow (the Ray “driver”) connects to your existing cluster (ray://…). Ray then executes tasks/actors inside Docker containers on each node. - This does NOT change the Ray worker processes’ base image or your AMIs; it just containers your user code. - Requirements: - Docker installed and accessible on all nodes - Nodes can pull your image (auth if private) - If you need GPUs in the container, you’ll need NVIDIA drivers on the host and pass GPU flags A) Set a default container for the whole flow’s Ray session - Pass runtime_env via RayTaskRunner.init_kwargs:
Copy code
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(
    task_runner=RayTaskRunner(
        address="ray://<head-ip>:10001",
        init_kwargs={
            "runtime_env": {
                "container": {
                    "image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
                    # optional: docker run flags
                    "run_options": [
                        "--gpus=all",     # if you need GPUs
                        "--ipc=host",
                        # "--volume=/host/path:/container/path"
                    ],
                    # pass env to the container
                    "env_vars": {"EXTRA_FLAG": "1"},
                }
            }
        },
    )
)
def my_flow():
    ...
B) Choose different images per task/section - Use Prefect-Ray’s remote_options context to override per task/actor:
Copy code
from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

@task
def heavy_gpu_work(x): ...
@task
def cpu_only_work(y): ...

@flow(task_runner=RayTaskRunner(address="ray://<head-ip>:10001"))
def my_flow(image_for_gpu: str, image_for_cpu: str):
    # GPU section
    with remote_options(
        runtime_env={"container": {"image": image_for_gpu, "run_options": ["--gpus=all"]}},
        num_gpus=1,
    ):
        a = heavy_gpu_work.submit(1)

    # CPU section
    with remote_options(
        runtime_env={"container": {"image": image_for_cpu}},
        num_cpus=2,
    ):
        b = cpu_only_work.submit(2)

    return a, b
How ray.init fits in -
ray.init()
with an address (e.g.,
ray://...
) connects to your existing cluster; it does not create a new one. Without an address, it starts a local Ray on the current machine. - Prefect’s RayTaskRunner calls
ray.init(...)
under the hood. So with your existing cluster, you’ll pass
address="ray://<head-ip>:10001"
to attach to it. - Docs: - Ray Client (ray://): Ray Client - ray.init reference: ray.init API Operational checklist for Docker runtime_env - On each EC2 node: - Docker installed and running (docker ps works without sudo) - The Ray process user belongs to docker group - PATH includes docker for Ray launch environment - Registry auth configured (if private images) - If using GPUs: - Host has NVIDIA drivers - Use run_options like
--gpus=all
- Test quickly: - On a node, manually
docker pull <your-image>
- In a simple Python shell connected via
ray.init(address="ray://...")
, run a trivial task with a
runtime_env={"container": {"image": "<your-image>"}}
If you can share your AMI/OS and how you launch Ray (ray up YAML, systemd, etc.), I can provide exact install and PATH steps to make Docker available to Ray on those nodes.
s
@Marvin I dont understand what you mean by this "- Make sure the docker binary is on PATH for Ray’s processes. If Ray is launched by systemd or a service, ensure that environment has PATH to docker, or set it explicitly.". I think there is a fundamental gap in my knowledge around what PATH means. Also, I do install docker and setup ecr credential helper in the base ami that I use for all my ray tasks. I just dont re-login or whatever. # Docker installation and setup exec_remote_cmd "sudo yum install -y docker" exec_remote_cmd "sudo systemctl enable docker" exec_remote_cmd "sudo systemctl start docker" exec_remote_cmd "sudo usermod -aG docker ec2-user" Also, tell me more on how I can do the simple python shell test? I would love to do that for quicker debugging.
m
thought for 54.2 seconds
Totally fair questions — let’s clear up PATH, why Podman showed up, how to make Ray use Docker, and how to do a quick “containerized task” test from a Python shell. What PATH means (and why it matters here) - PATH is an environment variable that lists directories the OS searches for executables when you run a command without a full path. - Example: if PATH includes /usr/bin, running docker will execute /usr/bin/docker. - If PATH does not include the directory that contains docker, then a process calling docker will fail to find it even if Docker is installed. Quick ways to inspect it:
Copy code
echo $PATH
which docker
- If
which docker
prints nothing, the current shell cannot find Docker on PATH. - Processes started by services (like systemd) may have a different PATH than your interactive shell. That’s why something that works in your SSH session can fail for a service. Your current setup and the two common gotchas You’re doing:
Copy code
sudo yum install -y docker
sudo systemctl enable docker
sudo systemctl start docker
sudo usermod -aG docker ec2-user
Two important follow-ups: 1) Group membership refresh: Adding
ec2-user
to the
docker
group does not take effect until you start a new login session. Without that, the user (and any processes you start) won’t have permission to talk to Docker. Fix options: - Log out and log back in, or - Run
newgrp docker
in the provisioning script before starting Ray, or - Start Ray only after a fresh login where
groups
shows
docker
. 2) Ensure Docker is discoverable for the Ray processes: - When you start Ray (head and workers), do it from a shell/session where
which docker
works. - If you use a service/daemon to start Ray, make sure that service’s environment includes
/usr/bin
(where Docker typically lives on Amazon Linux) so
docker
is on PATH for that service. - If Podman is installed and Docker is not visible on PATH to the Ray process, Ray may attempt Podman and you’ll see “podman not found”. I couldn’t find an official Ray setting to “force Docker over Podman” — Ray detects the container runtime by checking what’s available on PATH. The practical fix is to ensure Docker is installed, on PATH, and accessible; or remove Podman so there’s no ambiguity. Simple Python shell test you can run on an EC2 node Pick any node (head is easiest). This verifies: - The process can see Docker on PATH - You can connect to your cluster - A task runs in your specified container image 1) Sanity checks on the node
Copy code
which docker
docker ps
groups
Expect
docker
found,
docker ps
works without sudo, and
docker
in the group list. If not, log out/in or
newgrp docker
and try again. 2) Test from Python Replace HEAD_ADDR with your Ray head’s address and IMAGE with your image. ``` python - << 'PY' import os, shutil, ray, subprocess, json HEAD_ADDR = "ray//&lt;head ip&gt;10001" IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) # Connect to your existing cluster (does not start a new one) ray.init(address=HEAD_ADDR) @ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0, } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options( runtime_env={ "container": { "image": IMAGE, # Uncomment if you need GPUs or other flags # "run_options": ["--gpus=all", "--ipc=host"], # "env_vars": {"EXAMPLE": "1"}, } }
).remote() print(json.dumps(ray.get(task), indent=2)) PY
Copy code
What to look for:
- The “Without container” and “With container” sections should differ in `os_release` (confirming your container image ran).
- If you get errors like “podman not found” or “docker: permission denied”:
  - Make sure `which docker` returns a path in both the driver and inside the non-container task.
  - Re-check group membership took effect (new login/newgrp).
  - Ensure Docker is installed on every worker node, not just the head.
  - Try pre-pulling the image on nodes (`docker pull <IMAGE>`) to rule out registry/auth issues.

Using this with Prefect + Ray
Once the node-level test works, apply the same runtime_env via Prefect-Ray:

- Default container for all tasks in a flow:
from prefect import flow from prefect_ray.task_runners import RayTaskRunner @flow( task_runner=RayTaskRunner( address="ray//&lt;head ip&gt;10001", init_kwargs={ "runtime_env": { "container": { "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310", # "run_options": ["--gpus=all"] } } }, ) ) def my_flow(): ...
Copy code
- Different images per section/task:
from prefect import flow, task from prefect_ray.task_runners import RayTaskRunner from prefect_ray.context import remote_options @task def work(x): ... @flow(task_runner=RayTaskRunner(address="ray//&lt;head ip&gt;10001")) def my_flow(image1: str, image2: str): with remote_options(runtime_env={"container": {"image": image1}}): a = work.submit(1) with remote_options(runtime_env={"container": {"image": image2}}): b = work.submit(2) return a, b ``` Relevant docs - Running Ray tasks/actors in containers: Ray docs: containerized tasks/actors - Ray Client (ray://) overview: Ray Client - ray.init API: ray.init reference - Prefect-Ray integration: Prefect-Ray If you share how you start Ray on EC2 (manual, autoscaler, or a service), I can suggest the best place to refresh group membership and guarantee PATH visibility so Docker is always found.
s
@Marvin I start ray using the config YAML. I call it like ray up ray-cluster-config-docker.yaml from my local machine. I have ray cli installed and aws logged in. So that starts a ray head node on ec2. Can you tell me step by step how to do the python/docker test on the head node? I have one up right now
m
thought for 59.3 seconds
Awesome — since you’re using the Ray cluster launcher (ray up), here’s a concrete, step-by-step “Docker + Python” test you can run right now on the head node to validate Option C and debug PATH/permissions. Quick note on PATH (since you asked) - PATH is an environment variable listing directories where the system looks for executables. If
docker
isn’t in any of those directories, the process won’t find it. - Your interactive SSH session and the Ray worker processes may have different PATHs and group memberships. That’s why “docker works for me in SSH” can still fail for Ray tasks. Step-by-step on the head node 0) Attach to the head node - From your laptop:
Copy code
ray attach ray-cluster-config-docker.yaml
This opens an SSH session to the head. 1) Verify Docker is usable without sudo
Copy code
which docker
docker ps
id -nG
- If
docker ps
gives “permission denied” and you just added the user to the
docker
group, refresh group membership in this session:
Copy code
newgrp docker
docker ps
- If
which docker
returns nothing, Docker isn’t on PATH in this shell. On Amazon Linux, Docker usually lives in
/usr/bin
. You can try:
Copy code
export PATH="/usr/bin:$PATH"
which docker
Tip: Ray worker processes inherit env from the shell that started Ray. If Docker wasn’t on PATH or group membership wasn’t ready when Ray started, you may need to restart Ray after fixing it (see step 6). 2) Sanity check image pull (optional but recommended) Replace IMAGE with your ECR image.
Copy code
docker pull <http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>
If this fails with auth, ensure your instances have an IAM role with ECR read and that your ECR credential helper is configured. 3) Verify Ray is up and you can connect locally
Copy code
ray status
python -c "import ray; print(ray.__version__)"
If Python can’t import Ray, use
python3
or activate the env that Ray installed into (depends on your cluster YAML’s setup_commands). Often
python
already works. 4) Run the minimal Python test that compares “no container” vs “container” execution Replace IMAGE with yours. This connects to the existing cluster (does not start a new one) and runs a tiny task twice: once normally, once inside your container.
Copy code
python - << 'PY'
import os, shutil, ray, subprocess, json

IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310"

print("Driver PATH:", os.environ.get("PATH"))
print("Driver which(docker):", shutil.which("docker"))

# Connect to the local running cluster on the head
ray.init(address="auto")

@ray.remote
def info():
    import os, subprocess, shutil
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
        "python": subprocess.getoutput("python -V 2>&1"),
        "os_release": subprocess.getoutput("cat /etc/os-release || true"),
        "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0,
    }

print("\n--- Without container ---")
print(json.dumps(ray.get(info.remote()), indent=2))

print("\n--- With container runtime_env ---")
task = info.options(
    runtime_env={
        "container": {
            "image": IMAGE,
            # Uncomment/add if needed:
            # "run_options": ["--gpus=all", "--ipc=host"],
            # "env_vars": {"EXAMPLE": "1"},
        }
    }
).remote()

print(json.dumps(ray.get(task), indent=2))
PY
What success looks like - The “With container runtime_env” output should show a different
os_release
that matches your container base. That proves tasks ran inside your image. -
which_docker
in the “Without container” section should be non-null. If it’s null, the Ray worker environment doesn’t see Docker on PATH. Common fixes if you still see “podman not found” or “docker not found/permission denied” - Docker not found: ensure
/usr/bin
is on PATH and Docker is installed on ALL nodes (head + workers).
- Permission denied: ensure the Ray process user is in the
docker
group; after adding, start a new session (or use
newgrp docker
) and restart Ray so worker processes pick up the change. - You may need to restart the cluster processes after fixing group/PATH: - From the head:
ray stop
then re-run
ray up ray-cluster-config-docker.yaml
- Or just re-run
ray up
from your laptop to redeploy/start Ray. - Private ECR: make sure instances have IAM role with ECR pull perms, or that your ECR credential helper is configured under the same user that runs Ray. Using this in Prefect after the test works - Set the runtime_env globally for the flow:
Copy code
from prefect import flow
from prefect_ray.task_runners import RayTaskRunner

@flow(task_runner=RayTaskRunner(
    address="ray://<head-ip>:10001",  # or address="auto" if running on head
    init_kwargs={
        "runtime_env": {
            "container": {
                "image": "<http://123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310|123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310>",
                # "run_options": ["--gpus=all"]
            }
        }
    }
))
def my_flow():
    ...
- Or choose different images per task section using `remote_options`:
Copy code
from prefect_ray.context import remote_options
with remote_options(runtime_env={"container": {"image": "your-other-image"}}):
    ...
Official docs for what we’re using - Running Ray tasks/actors in containers (runtime_env): Ray: containerized tasks/actors - Ray Client and addresses: Ray Client - ray.init reference: ray.init API - Prefect-Ray integration: Prefect-Ray If you hit an error during the test, paste the outputs for: -
which docker
,
docker ps
,
id -nG
- The JSON from both “Without container” and “With container” sections …and I’ll help pinpoint whether it’s PATH, group membership, image pull, or something else.
s
@Marvin So everything was perfect (docker had the path, ec2-user was in the group etc) but here is the output of the python test script: python test.py Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-09 231621,786 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.17.173:6379... 2025-10-09 231621,803 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.17.173:8265 --- Without container --- (autoscaler +10s) Tip: use
ray status
to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +10s) Adding 1 node(s) of type ray.worker.default. { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found (autoscaler +40s) Resized to 2 CPUs. (raylet, ip=172.31.25.199) [2025-10-09 231759,241 E 5840 5840] (raylet) worker_pool.cc:553: Some workers of the worker process(5967) have not registered within the timeout. The process is dead, probably it crashed during start. (raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found
m
thought for 96.1 seconds
Thanks for the detailed output — that helps a lot. What your test shows - The “without container” task ran fine. - The “with container” task was scheduled on a new worker (ip=172.31.25.199) and failed with: - bash: exec: podman: not found - This strongly suggests that on that worker node, Ray’s container runtime detection didn’t find Docker (or Docker wasn’t usable from the raylet’s environment), so it tried Podman and… Podman isn’t installed either. Why this happens with ray up on EC2 - The autoscaler can bring up new worker nodes after you connect. If your worker AMI does not have Docker installed/configured, or your cluster YAML only installs Docker on the head, then any tasks that require the container runtime_env will fail on those workers. - The head showed Docker on PATH, but the failing node is a different worker. We need to verify Docker on workers too. Step-by-step: validate and fix on the worker node 1) SSH to the worker (the failing IP from your logs is 172.31.25.199) - From your workstation, use the same key and user as your ray cluster config:
Copy code
ssh -i /path/to/your/key.pem ec2-user@172.31.25.199
2) Check Docker availability and permissions
Copy code
which docker
docker ps
id -nG
- You want: - which docker -> /usr/bin/docker (or similar) - docker ps works without sudo - your user (e.g., ec2-user) is in the docker group 3) If Docker is missing or needs setup on the worker - Install and enable:
Copy code
sudo yum install -y docker
sudo systemctl enable --now docker
sudo usermod -aG docker ec2-user
# Open a fresh login OR run:
newgrp docker
docker ps
- If you’re pulling from ECR, make sure the instance has an IAM role with ECR pull perms and that your credential helper is configured for the same user that runs Ray. 4) Make it automatic for all future workers - Update your Ray cluster YAML so workers get Docker too, not just the head. Use setup_commands (applies to head+workers) or worker_setup_commands. Example (simplified):
Copy code
# ray-cluster-config-docker.yaml (excerpt)
auth:
  ssh_user: ec2-user

setup_commands:
  - sudo yum install -y docker
  - sudo systemctl enable --now docker
  - sudo usermod -aG docker ec2-user
  # ensure current shell has docker group for subsequent commands
  - newgrp docker || true
  - docker --version
- Alternatively, bake Docker + group membership into the AMI used for BOTH head and workers, and reference that AMI in available_node_types for workers. 5) Restart Ray to ensure new nodes inherit the correct env/groups - After adjusting your setup (or AMI), restart the cluster so new workers are provisioned with Docker: - Simplest: scale down to zero workers, then back up; or tear down and
ray up
again against the updated YAML. Re-run the quick Python test on the head - This time, force it to use a worker (by letting the autoscaler bring one up) and confirm the containerized task works. For convenience, here’s the same test block you can paste on the head (replace IMAGE): ``` python - << 'PY' import os, shutil, ray, subprocess, json IMAGE = "123456789012.dkr.ecr.us-east-1.amazonaws.com/your-image:py310" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) ray.init(address="auto") @ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), "has_nvidia_smi": subprocess.call("nvidia-smi >/dev/null 2>&1", shell=True) == 0, } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options( runtime_env={ "container": { "image": IMAGE, # "run_options": ["--gpus=all", "--ipc=host"], # if needed }
} ).remote() print(json.dumps(ray.get(task), indent=2)) PY ``` Notes on PATH and why Podman appeared - Ray chooses a container runtime by probing what’s available to the raylet/worker process on each node. If Docker isn’t installed, not on PATH, or not usable, Ray may attempt Podman. In your case, the worker didn’t have a usable Docker context, so it tried Podman and failed. - Each node is independent: your head node having Docker on PATH doesn’t guarantee workers do. Docs to back this up - Running tasks/actors in containers (Ray runtime_env): Ray: containerized tasks/actors - Ray Client (connecting to existing clusters, ray://): Ray Client - ray.init reference: ray.init API - Prefect-Ray integration: Prefect-Ray integration To summarize your next moves - Verify Docker on the failing worker (172.31.25.199). Fix/install if needed. - Ensure your cluster YAML provisions Docker for ALL workers (setup_commands or AMI). - Restart/scale the cluster and re-run the Python test. - Once it works in Ray, use the same runtime_env in your Prefect flow’s
RayTaskRunner(init_kwargs={...})
. If you share your cluster YAML (sanitized) I can point to the exact section to add the worker Docker setup so the autoscaler always brings up container-ready nodes.
s
@Marvin so I logged into my worker and tried all the docker commands. They work perfectly: [ec2-user@ip-172-31-25-199 ~]$ which docker /usr/bin/docker [ec2-user@ip-172-31-25-199 ~]$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES [ec2-user@ip-172-31-25-199 ~]$ id -nG ec2-user adm wheel systemd-journal docker [ec2-user@ip-172-31-25-199 ~]$ docker pull 971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231 pd231: Pulling from prefect-flows bc0965b23a04: Pull complete c58b365c3bcb: Pull complete 3bc3f7229179: Pull complete 42dacf6a4601: Pull complete 170afa08c0a6: Pull complete 14e285c56f13: Pull complete 6ac45919b651: Pull complete bfe622ecd2e2: Pull complete 64d7dbf1fd50: Pull complete Digest: sha256:87730a4abd607ef4774f8ea840de2e1d2f909ef0b7e9b1aecce8c584b445a6b9 Status: Downloaded newer image for 971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231 971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231 but the test on the head is still not working: python test2.py Driver PATH: /home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin Driver which(docker): /usr/bin/docker 2025-10-09 232648,286 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.31.17.173:6379... 2025-10-09 232648,293 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://172.31.17.173:8265 --- Without container --- { "PATH": "/home/ec2-user/.pyenv/versions/3.11.10/bin/home/ec2 user/.pyenv/libexec/home/ec2-user/.pyenv/plugins/python-build/bin/home/ec2 user/.pyenv/plugins/pyenv virtualenv/bin/home/ec2-user/.pyenv/plugins/pyenv-update/bin/home/ec2 user/.pyenv/plugins/pyenv doctor/bin/home/ec2-user/.pyenv/shims/home/ec2 user/.pyenv/bin/usr/local/bin/usr/bin/usr/local/sbin/usr/sbin/home/ec2-user/.local/bin:/home/ec2-user/bin", "which_docker": "/usr/bin/docker", "whoami": "ec2-user", "python": "Python 3.11.10", "os_release": "NAME=\"Amazon Linux\"\nVERSION=\"2\"\nID=\"amzn\"\nID_LIKE=\"centos rhel fedora\"\nVERSION_ID=\"2\"\nPRETTY_NAME=\"Amazon Linux 2\"\nANSI_COLOR=\"0;33\"\nCPE_NAME=\"cpe2.3oamazonamazon_linux:2\"\nHOME_URL=\"https://amazonlinux.com/\"\nSUPPORT_END=\"2026-06-30\"", "has_nvidia_smi": false } --- With container runtime_env --- (raylet, ip=172.31.25.199) bash: line 0: exec: podman: not found
m
thought for 96.3 seconds
This is super helpful. Given your config file is named ray-cluster-config-docker.yaml, I’m 99% sure Ray is running inside a Docker container on each node. That explains the symptom: - On the host, docker works fine (as you showed). - But the raylet process is inside a Ray container that likely does not have the docker CLI, nor the Docker socket mounted. Inside that container, Ray tries to find a container runtime. It doesn’t see docker, falls back to podman, and you get “podman: not found”. Quick check on the worker to confirm - See if Ray is running in a Docker container: -
docker ps | grep -i ray
(you should see a container for ray head/worker) - Or:
ps -ef | grep raylet
then
sudo nsenter -t <raylet-pid> -m -u -i -n env | grep PATH
(advanced) - Also, jump into the Ray container and check docker availability: -
docker exec -it <ray-container-name> bash -lc 'which docker || echo no-docker; ls -l /var/run/docker.sock || echo no-sock'
Why your Python test fails - The containerized tasks feature (runtime_env["container"]) shells out to a container runtime from the raylet. Since raylet is inside a container without docker, it can’t call docker on the host unless you: - install a docker CLI inside the Ray container, and - mount the host’s Docker socket into the Ray container, and - ensure the user in the Ray container can access the socket. Two ways to fix it Option 1 — Run Ray directly on the host (simplest) - Remove the docker: section from your cluster YAML so Ray doesn’t run inside a container. - Recreate/scale your cluster. - Your existing runtime_env container test should start working immediately because raylet will now see /usr/bin/docker on the host. Option 2 — Keep Ray in Docker, but give it access to Docker - Update the cluster YAML’s docker section to: - Use a Ray image that has the docker CLI installed, or extend it yourself: - Example Dockerfile:
Copy code
FROM rayproject/ray:2.48.0-py310
      RUN apt-get update && apt-get install -y <http://docker.io|docker.io> && rm -rf /var/lib/apt/lists/*
- Mount the Docker socket and any needed privileges: - In your YAML docker.run_options, add: -
--volume=/var/run/docker.sock:/var/run/docker.sock
-
--ipc=host
- Optional:
--privileged
(or granular capabilities if needed) - Ensure the container user can talk to the socket: - The Docker socket is typically owned by group “docker” on the host. You can: - Run the Ray container as root (quick and dirty), or - Create a matching group inside the container with the host’s docker group GID and add the container user to it. - After these changes, redeploy/scale your cluster so new Ray containers have docker and the socket. After applying either option, re-run the Python test on the head - Same test you used; it should now run the “With container runtime_env” section successfully and you should see a different os_release (from your image). If you want to verify the raylet’s environment right now - On the worker: - Find the raylet PID:
pgrep -f raylet
- Inspect its env PATH:
sudo tr '\0' '\n' </proc/$(pgrep -f raylet | head -n1)/environ | grep PATH
- If Ray is running in Docker, that PATH will be the container’s PATH and won’t include docker unless you install it inside that container. Docs for what we’re using - Running tasks and actors in containers (runtime_env): Ray: containerized tasks/actors - Ray Client (connecting to an existing cluster): Ray Client - ray.init reference: ray.init API - Prefect-Ray integration: Prefect-Ray integration
If you share the docker: section of your ray-cluster-config-docker.yaml (sanitized), I can suggest exact run_options and a minimal Dockerfile tweak so Ray-in-Docker can call Docker-on-host.
s
@Marvin whats the pid? docker ps | grep -i ray [ec2-user@ip-172-31-17-173 ~]$ ps -ef | grep raylet ec2-user 7004 1 0 23:09 ? 000008 /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --store_socket_name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --object_manager_port=8076 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=172.31.17.173 --maximum_startup_concurrency=1 --static_resource_list=node172.31.17.173,1.0,node__internal_head__,1.0,memory,301434471,object_store_memory,150717235 --python_worker_command=/home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/_private/workers/setup_worker.py /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/_private/workers/default_worker.py --node-ip-address=172.31.17.173 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --raylet-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --redis-address=None --temp-dir=/tmp/ray --metrics-agent-port=56338 --runtime-env-agent-port=54436 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --runtime-env-agent-port=54436 --gcs-address=172.31.17.173:6379 --session-name=session_2025-10-09_23-08-47_338911_6867 --temp-dir=/tmp/ray --webui=172.31.17.173:8265 --cluster-id=967dc0aa760cbb1b48760385fd50882011274a5bcded820f04f5add5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command= --native_library_path=/home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/cpp/lib --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867 --log_dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --resource_dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/runtime_resources --metrics-agent-port=56338 --metrics_export_port=51597 --runtime_env_agent_port=54436 --object_store_memory=150717235 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=172.31.17.173:6379 --session-name=session_2025-10-09_23-08-47_338911_6867 --labels= --cluster-id=967dc0aa760cbb1b48760385fd50882011274a5bcded820f04f5add5 --head --num_prestart_python_workers=0 --dashboard_agent_command=/home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 -u /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/dashboard/agent.py --node-ip-address=172.31.17.173 --metrics-export-port=51597 --dashboard-agent-port=56338 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --raylet-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867 --log-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2025-10-09_23-08-47_338911_6867 --gcs-address=172.31.17.173:6379 --runtime_env_agent_command=/home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 -u /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=172.31.17.173 --runtime-env-agent-port=54436 --gcs-address=172.31.17.173:6379 --runtime-env-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/runtime_resources --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --log-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --temp-dir=/tmp/ray ec2-user 7041 7004 1 23:09 ? 000018 /home/ec2-user/.pyenv/versions/3.11.10/bin/python3.11 -u /home/ec2-user/.pyenv/versions/3.11.10/lib/python3.11/site-packages/ray/dashboard/agent.py --node-ip-address=172.31.17.173 --metrics-export-port=51597 --dashboard-agent-port=56338 --listen-port=52365 --node-manager-port=46559 --object-store-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/plasma_store --raylet-name=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867 --log-dir=/tmp/ray/session_2025-10-09_23-08-47_338911_6867/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2025-10-09_23-08-47_338911_6867 --gcs-address=172.31.17.173:6379 --agent-id 424238335 ec2-user 8159 7438 0 23:31 pts/1 000000 grep --color=auto raylet
m
thought for 91.4 seconds
Nice — thanks for the details. Two key points based on your output: - The head’s raylet PID is 7004 (from your ps output). But the failure happens on the worker at 172.31.25.199, so we need to inspect the raylet environment on the worker. - Even though docker works interactively on the worker, Ray is invoking the container runtime from the worker’s raylet/runtimenv-agent process. If that process’s environment cannot find docker on PATH at launch time, it’ll try podman and you see “podman: not found”. Let’s verify and fix, step by step. A) Inspect the worker’s raylet environment (172.31.25.199) Run these on the worker:
Copy code
# 1) Confirm raylet PID (on the worker)
pgrep -f raylet

# 2) Show the PATH visible to the raylet process (replace <PID> with output above)
sudo tr '\0' '\n' </proc/<PID>/environ | grep ^PATH=

# 3) Sanity check your current shell
echo $PATH
which docker
id -nG

# 4) Check runtime env agent logs for clues
ls -1 /tmp/ray/session_*/logs | grep runtime_env
grep -i -E 'podman|docker' /tmp/ray/session_*/logs/runtime_env* 2>/dev/null | tail -n 100
If PATH for the raylet process does not include /usr/bin or which docker is None for the raylet, that explains why it can’t find docker at container launch time even though you can find it interactively. B) Pin a simple “info” task to the worker (no container) to see what that worker reports Run this from the head (or anywhere that can connect), but force the task onto 172.31.25.199:
Copy code
python - << 'PY'
import ray, json, os, shutil

ray.init(address="auto")

# Make an info task
@ray.remote
def info():
    import os, shutil, subprocess
    return {
        "PATH": os.environ.get("PATH"),
        "which_docker": shutil.which("docker"),
        "whoami": subprocess.getoutput("whoami"),
    }

# Force the task to run on the target worker via node resource
target_ip = "172.31.25.199"
res = ray.get(info.options(resources={f"node:{target_ip}": 0.001}).remote())
print(json.dumps(res, indent=2))
PY
If
which_docker
comes back null here, the environment that Ray gives to workers on that node cannot see docker. C) How to fix (common causes + remedies) - PATH not present for Ray processes: - The Ray autoscaler starts ray processes with a different environment than your interactive shell. Ensure /usr/bin is on PATH in the environment used to start ray. You can enforce PATH in your cluster YAML start commands: - Example (add to worker_start_ray_commands and head_start_ray_commands):
Copy code
head_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --head --dashboard-host=0.0.0.0 --port=6379'
    worker_start_ray_commands:
      - 'export PATH=/usr/bin:$PATH; ray stop'
      - 'export PATH=/usr/bin:$PATH; ray start --address=$RAY_HEAD_IP:6379'
Adjust flags to match your config. Re-run
ray up
after editing. - Group membership timing: - If you add ec2-user to the docker group in setup_commands, the membership won’t apply to already-running processes. Make sure Docker install + usermod happen before Ray starts, and start Ray only in a fresh session where
groups
shows docker. A simple way: - Do the docker install/usermod in setup_commands - In start_ray_commands, do not restart usermod; just start Ray. If needed, force a login shell or use
newgrp docker
before
ray start
. - Confirm docker and ECR access for the raylet: - After you fix PATH, try the same Python “with container” test again. If you hit auth issues, pre-pull the image on the worker and/or ensure the instance profile gives ECR pull permissions. D) Re-run the container test (on the head) Same script you used before; it should now succeed: ``` python - << 'PY' import os, shutil, ray, subprocess, json IMAGE = "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231" print("Driver PATH:", os.environ.get("PATH")) print("Driver which(docker):", shutil.which("docker")) ray.init(address="auto")
@ray.remote def info(): import os, subprocess, shutil return { "PATH": os.environ.get("PATH"), "which_docker": shutil.which("docker"), "whoami": subprocess.getoutput("whoami"), "python": subprocess.getoutput("python -V 2>&1"), "os_release": subprocess.getoutput("cat /etc/os-release || true"), } print("\n--- Without container ---") print(json.dumps(ray.get(info.remote()), indent=2)) print("\n--- With container runtime_env ---") task = info.options(runtime_env={"container": {"image": IMAGE}}).remote() print(json.dumps(ray.get(task), indent=2)) PY
Copy code
E) Why “podman: not found” despite docker being installed
- Ray’s containerized tasks feature shells out from the raylet/runtime-env agent. If the raylet’s environment cannot find docker (binary not on PATH at the time the process was started), Ray may try podman. Your interactive shell is fine, but the raylet process environment may be missing PATH entries.
- Ensuring PATH (and group membership) for the process that starts Ray fixes this.

Once this test works, you can use the same runtime_env in Prefect-Ray’s RayTaskRunner:
- Global for the flow:
@flow(task_runner=RayTaskRunner( address="ray//&lt;head ip&gt;10001", init_kwargs={ "runtime_env": {"container": {"image": "971422704224.dkr.ecr.ap-south-1.amazonaws.com/prefect-flows:pd231"}} }, )) def my_flow(): ... ``` - Or per section with
remote_options
. Docs - Ray: Running tasks/actors in containers: Ray docs: containerized tasks/actors - Ray Client (ray://): Ray Client - ray.init reference: ray.init API - Prefect-Ray: Prefect-Ray integration If you can paste: - The PATH seen by the worker’s raylet (proc environ), and - The output of the pinned info task on 172.31.25.199, …I can advise exactly which YAML changes to make so every autoscaled worker reliably sees docker.