< Marvin> RayTaskRunner is running containers with podman by Prefect Community #ask-marvin

Join Slack

<@ULVA73B9P> RayTaskRunner is running containers w...

# ask-marvin

Saad80

09/05/2025, 4:57 AM

@Marvin RayTaskRunner is running containers with podman by default. How do I switch it to Docker?

Marvin

09/05/2025, 4:57 AM

✅ thought for 222.9 seconds

Marvin

09/05/2025, 5:01 AM

Short answer: Prefect doesn’t pick Podman or Docker here—Ray does. Ray’s container support uses the Docker SDK and connects to whichever “Docker-compatible” socket your environment points to. On many Fedora/RHEL setups, that’s Podman by default. To switch to Docker, point Ray to Docker’s socket. What to do: - Make sure Docker is installed and the daemon is running. - Set DOCKER_HOST for the process that starts your flow/worker so it points to Docker’s socket (and not Podman’s): - Common Linux default:

Copy code

export DOCKER_HOST=unix:///var/run/docker.sock

- Then start your worker/flow in the same shell:

Copy code

prefect worker start -p <your-pool>

- Or when running a script:

Copy code

DOCKER_HOST=unix:///var/run/docker.sock python your_flow.py

- Unset/override any env vars that point to Podman (e.g., DOCKER_HOST=unix:///run/user/<uid>/podman/podman.sock). - If your machine has podman-docker installed or a Podman socket taking over /var/run/docker.sock, either remove that shim or explicitly set DOCKER_HOST as above so Ray uses Docker. If you’re using Ray containers via runtime_env in Prefect:

Copy code

from prefect import flow, task
from prefect_ray.task_runners import RayTaskRunner
from prefect_ray.context import remote_options

@task
def say_hi():
    return "hi"

@flow(task_runner=RayTaskRunner())
def my_flow():
    with remote_options(runtime_env={"container": {"image": "python:3.11"}}):
        return say_hi()

if __name__ == "__main__":
    # Ensure this process sees DOCKER_HOST pointing at Docker (not Podman)
    # export DOCKER_HOST=unix:///var/run/docker.sock
    my_flow()

Ray will use whichever “Docker” socket your environment points to; the code itself does not choose Podman vs Docker. Troubleshooting quick checks: - See where you’re pointed:

Copy code

echo $DOCKER_HOST

- If still hitting Podman, stop the user-level Podman socket or remove the podman-docker shim so /var/run/docker.sock is truly Docker, or always set DOCKER_HOST explicitly when starting Prefect/Ray. Docs: - Prefect + Ray integration: Prefect Ray integration If you can share how you’re launching Ray (local vs cluster), your OS, and where you set runtime_env, I can tailor the exact steps.

Saad80

09/05/2025, 5:24 AM

@Marvin I have an ecs worker and ecs work pool. I have a ray head node on ec2. and my flow looks like this. I want each task to run in its own docker container. But the flow runner container in ecs gave me the podman not found logs (with IP addresses for ray workers) import os import time from dotenv import load_dotenv from prefect import flow, task from prefect_ray import RayTaskRunner from prefect.deployments import DeploymentImage from prefect.tasks import task_input_hash from prefect_ray.context import remote_options import pandas as pd import random PANDAS_VERSION = "2.1.3" # Load both .env files to access all environment variables # TEAM: Make sure you have .env and .env.generated files in the ray-prefect-final directory load_dotenv('../.env') load_dotenv('../.env.generated') def get_ray_image_uri(): """Get the Docker image URI for Ray tasks.""" print("Call to get_ray_image_uri, shall be at build time.") # Use environment variable or fallback to hardcoded registry registry = os.getenv('AWS_ECR_REGISTRY') uri = f"{registry}:pd213" print(f"Image URI: {uri}") return uri @task(_cache_key_fn_=task_input_hash) def test(_in_arg_): print("test task started", _in_arg_) # print(pd.DataFrame([1,2,3])) # Validate Python version import sys print(f"Python version: {sys.version}") print(f"Python version info: {sys.version_info}") # Check pandas version _# print(f"Pandas version: {pd.version}")_ print(f"✅ Pandas version is correct: {PANDAS_VERSION}") _# if pd.version == PANDAS_VERSION:_ _# print(f"_✅ _Pandas version is correct: {PANDAS_VERSION}")_ # else: _# print(f"_❌ _Pandas version mismatch! Expected: {PANDAS_VERSION}, Got: {pd.version}")_ # raise Exception("Pandas version mismatch") time.sleep(100) print("test task finished", _in_arg_) @flow( _log_prints_=True, _task_runner_=RayTaskRunner( address=f"ray://{os.getenv('RAY_HEAD_NODE_IP')}:10001", # this shall enable Docker image to be used in the Ray cluster # but it is not working for some reason, pandas is not available. _# init_kwargs={"runtime_env": {"image_uri": get_ray_image_uri()}} # for ray 2.48.0_ _# init_kwargs={"runtime_env": {"container": {"image": get_ray_image_uri()}}} # for ray 2.9.2_ ) ) def test_ray(): args = [random.random() for _ in range(10)] args.append(0) # task that shall be always cached print("args", args) # Apply runtime environment to each task to run in separate containers with remote_options(_num_cpus_=1, resources={"worker_node_1": 0.001}, _runtime_env_={ "container": { "image": get_ray_image_uri(), "pull_before_run": True } } ): _# with remote_options(num_cpus=1, resources={"worker_node_1": 0.001}):_ test.map(args) if name == "__main__": # Check for required environment variables required_vars = { 'AWS_ECR_REGISTRY': os.getenv('AWS_ECR_REGISTRY'), 'PREFECT_DEPLOYMENT_NAME': os.getenv('PREFECT_DEPLOYMENT_NAME'), 'PREFECT_WORKPOOL_NAME': os.getenv('PREFECT_WORKPOOL_NAME'), } print("required_vars", required_vars) missing_vars = [var for var, value in required_vars.items() if not value] if missing_vars: print(f"❌ ERROR: Missing required environment variables: {', '.join(missing_vars)}") print("Please set these variables in your .env file or environment.") exit(1) test_ray.deploy( name="test_ray_1", _work_pool_name_=os.getenv('PREFECT_WORKPOOL_NAME'), description="Test Ray deployment", image=DeploymentImage( name=get_ray_image_uri(), platform="linux/amd64", dockerfile="Dockerfile", buildargs={ "BUILD_TIME": str(int(time.time())), # Force rebuild with timestamp } ), _job_variables_={ "cpu": 1024, "memory": 1024, "env": { "AWS_DEFAULT_REGION": os.getenv("AWS_DEFAULT_REGION"), "AWS_DEFAULT_BUCKET": os.getenv("AWS_DEFAULT_BUCKET"), "AWS_ECR_REGISTRY": os.getenv("AWS_ECR_REGISTRY"), "RAY_HEAD_NODE_IP": os.getenv("RAY_HEAD_NODE_IP"), } } )

Marvin

09/05/2025, 5:24 AM

Your message was too long, here's your message at the allowed limit: I have an ecs worker and ecs work pool I have a ray head node on ec2 and my flow looks like this I want each task to run in its own docker container But the flow runner container in ecs gave me the podman not found logs (with IP addresses for ray workers) import os import time from dotenv import load_dotenv from prefect import flow task from prefect_ray import RayTaskRunner from prefect deployments import DeploymentImage from prefect tasks import task_input_hash from prefect_ray context import remote_options import pandas as pd import random PANDAS_VERSION = 2 1 3 # Load both env files to access all environment variables # TEAM Make sure you have env and env generated files in the ray-prefect-final directory load_dotenv(' / env') load_dotenv(' / env generated') def get_ray_image_uri() Get the Docker image URI for Ray tasks print( Call to get_ray_image_uri shall be at build time ) # Use environment variable or fallback to hardcoded registry registry = os getenv('AWS_ECR_REGISTRY') uri = f {registry} pd213 print(f Image URI {uri} ) return uri @task(_cache_key_fn_=task_input_hash) def test(_in_arg_) print( test task started _in_arg_) # print(pd DataFrame([1 2 3])) # Validate Python version import sys print(f Python version {sys version} ) print(f Python version info {sys version_info} ) # Check pandas version _# print(f Pandas version {pd version} )_ print(f white_check_mark Pandas version is correct {PANDAS_VERSION} ) _# if pd version == PANDAS_VERSION _ _# print(f _ white_check_mark _Pandas version is correct {PANDAS_VERSION} )_ _# else _ _# print(f _ x _Pandas version mismatch! Expected {PANDAS_VERSION} Got {pd version} )_ # raise Exception( Pandas version mismatch ) time sleep(100) print( test task finished _in_arg_) @flow( _log_prints_=True _task_runner_=RayTaskRunner( address=f ray //{os getenv('RAY_HEAD_NODE_IP')} 10001 # this shall enable Docker image to be used in the Ray cluster _# but it is not working for some reason pandas is not available _ _# init_kwargs={ runtime_env { image_uri get_ray_image_uri()}} # for ray 2 48 0_ _# init_kwargs={ runtime_env { container { image get_ray_image_uri()}}} # for ray 2 9 2_ ) ) def test_ray() args = [random random() for _ in range(10)] args append(0) # task that shall be always cached print( args args) # Apply runtime environment to each task to run in separate containers with remote_options(_num_cpus_=1 resources={ worker_node_1 0 001} _runtime_env_={ container { image get_ray_image_uri() pull_before_run True } } ) _# with remote_options(num_cpus=1 resources={ worker_node_1 0 001}) _ test map(args) if name == main # Check for required environment variables required_vars = { 'AWS_ECR_REGISTRY' os getenv('AWS_ECR_REGISTRY') 'PREFECT_DEPLOYMENT_NAME' os getenv('PREFECT_DEPLOYMENT_NAME') 'PREFECT_WORKPOOL_NAME' os getenv('PREFECT_WORKPOOL_NAME') } print( required_vars required_vars) missing_vars = [var for var value in required_vars items() if not value] if missing_vars print(f x ERROR Missing required environment variables {' ' join(missing_vars)} ) print( Please set these variables in your env file or environment ) exit(1) test_ray deploy( name= test_ray_1 _work_pool_name_=os getenv('PREFECT_WORKPOOL_NAME') description= Test Ray deployment image=DeploymentImage( name=get_ray_image_uri() platform= linux/amd64 dockerfile= Dockerfile buildargs={ BUILD_TIME str(int(time time())) # Force rebuild with timestamp } ) _job_variables_={ cpu 1024 memory 1024 env { AWS_DEFAULT_REGION os getenv( AWS_DEFAULT_REGION ) AWS_DEFAULT_BUCKET os

Saad80

09/05/2025, 5:27 AM

@Marvin I have an ecs worker and ecs work pool I have a ray head node on ec2 and my flow looks like this I want each task to run in its own docker container But the flow runner container in ecs gave me the podman not found logs (with IP addresses for ray workers) import os import time from dotenv import load_dotenv from prefect import flow task from prefect_ray import RayTaskRunner from prefect deployments import DeploymentImage from prefect tasks import task_input_hash from prefect_ray context import remote_options import pandas as pd import random PANDAS_VERSION = 2 1 3 # Load both env files to access all environment variables # TEAM Make sure you have env and env generated files in the ray-prefect-final directory load_dotenv(' / env') load_dotenv(' / env generated') def get_ray_image_uri() Get the Docker image URI for Ray tasks print( Call to get_ray_image_uri shall be at build time ) # Use environment variable or fallback to hardcoded registry registry = os getenv('AWS_ECR_REGISTRY') uri = f {registry} pd213 print(f Image URI {uri} ) return uri @task(_cache_key_fn_=task_input_hash) def test(_in_arg_) print( test task started _in_arg_) # print(pd DataFrame([1 2 3])) # Validate Python version import sys print(f Python version {sys version} ) print(f Python version info {sys version_info} ) # Check pandas version _# print(f Pandas version {pd version} )_ print(f white_check_mark Pandas version is correct {PANDAS_VERSION} ) _# if pd version == PANDAS_VERSION _# print(f white_check_mark _Pandas version is correct {PANDAS_VERSION} )_ # else # print(f x _Pandas version mismatch! Expected {PANDAS_VERSION} Got {pd version} )_ # raise Exception( Pandas version mismatch ) time sleep(100) print( test task finished _in_arg_) @flow( _log_prints_=True _task_runner_=RayTaskRunner( address=f ray //{os getenv('RAY_HEAD_NODE_IP')} 10001 # this shall enable Docker image to be used in the Ray cluster _# but it is not working for some reason pandas is not available # init_kwargs={ runtime_env { image_uri get_ray_image_uri()}} # for ray 2 48 0_ _# init_kwargs={ runtime_env { container { image get_ray_image_uri()}}} # for ray 2 9 2_ ) ) def test_ray() args = [random random() for _in range(10)] args append(0) # task that shall be always cached print( args args) # Apply runtime environment to each task to run in separate containers with remote_options(_num_cpus_=1 resources={ worker_node_1 0 001} _runtime_env_={ container { image get_ray_image_uri() pull_before_run True } } ) _# with remote_options(num_cpus=1 resources={ worker_node_1 0 001})_ test map(args) if name == main # Check for required environment variables required_vars = { 'AWS_ECR_REGISTRY' os getenv('AWS_ECR_REGISTRY') 'PREFECT_DEPLOYMENT_NAME' os getenv('PREFECT_DEPLOYMENT_NAME') 'PREFECT_WORKPOOL_NAME' os getenv('PREFECT_WORKPOOL_NAME') } print( required_vars required_vars) missing_vars = [var for var value in required_vars items() if not value] if missing_vars print(f x ERROR Missing required environment variables {' ' join(missing_vars)} ) print( Please set these variables in your env file or environment ) exit(1) test_ray deploy( name= test_ray_1 _work_pool_name_=os getenv('PREFECT_WORKPOOL_NAME') description= Test Ray deployment image=DeploymentImage( name=get_ray_image_uri() platform= linux/amd64 dockerfile= Dockerfile buildargs={ BUILD_TIME str(int(time time())) # Force rebuild with timestamp } ) _job_variables_={ cpu 1024 memo

Marvin

09/05/2025, 5:27 AM

✅ thought for 58.3 seconds

Marvin

09/05/2025, 5:28 AM

Thanks for the detail — this helps. What’s happening - Your ECS flow-runner is just the Ray driver. The “podman not found” messages you see in its logs are actually coming from the Ray nodes (EC2) when Ray tries to launch each task in its own container via runtime_env. - Prefect doesn’t pick Podman or Docker here; Ray does. Ray tries to use whatever “Docker-compatible” runtime is available on the Ray nodes. On many distros, Podman (or a podman-docker shim) is the default. How to switch Ray to Docker (on your EC2 Ray nodes) Do the following on every Ray node (head and any workers), because containers are launched on those nodes: 1) Ensure Docker is installed and running - Install Docker and start/enable the daemon. - Make sure the user that runs Ray can access Docker (add it to the docker group or run Ray under a user with access). 2) Point Ray to Docker’s socket (not Podman) - If your environment points to a Podman socket, override it for the Ray process:

Copy code

export DOCKER_HOST=unix:///var/run/docker.sock

- Start Ray (or your Ray service) in an environment where DOCKER_HOST points to Docker’s socket. For example:

Copy code

DOCKER_HOST=unix:///var/run/docker.sock ray start --head --port=10001 ...

and the same for any worker nodes. - If you have a podman-docker shim or a Podman socket bound at /var/run/docker.sock, either remove/disable that shim or always export DOCKER_HOST to Docker’s actual socket path so Ray doesn’t resolve to Podman. 3) Authenticate Docker to ECR (so Ray nodes can pull your image) - On each Ray node:

Copy code

aws ecr get-login-password --region <your-region> \
    | docker login --username AWS --password-stdin <your-account-id>.dkr.ecr.<your-region>.<http://amazonaws.com|amazonaws.com>

- Optional: pre-pull the image with

docker pull <your-ecr-uri>

to validate access. 4) Match the runtime_env syntax to your Ray version - Ray 2.9+:

Copy code

with remote_options(
      num_cpus=1,
      runtime_env={
          "container": {
              "image": "<your ECR URI>",
              "pull_before_run": True
          }
      }
  ):
      test.map(args)

- Ray 2.4–2.8:

Copy code

with remote_options(
      num_cpus=1,
      runtime_env={"image_uri": "<your ECR URI>"}
  ):
      test.map(args)

Your flow code is already set up this way; just be sure the version on your EC2 Ray nodes matches the

runtime_env

shape you use. 5) Sanity checks on a Ray node - Confirm what your environment is pointing to:

Copy code

echo $DOCKER_HOST
  which docker
  which podman
  docker info

- If

podman

is installed or a podman-docker shim is present, consider removing it or explicitly set

DOCKER_HOST

before starting Ray to avoid accidental Podman resolution. Why the logs show up in ECS - The driver (in ECS) connects to the Ray head and streams logs, so you’ll see worker-side errors (“podman not found”) in the ECS task logs even though the fix is on the Ray nodes. Reference - Prefect + Ray integration overview: Prefect Ray integration If you can share: - Ray version on your EC2 nodes - OS/distro on the Ray nodes -

echo $DOCKER_HOST

which docker

which podman

from a Ray node - A small log snippet around the “podman not found” line …I can tailor exact commands (systemd drop-ins, Ray start lines, etc.) for your setup.

Saad80

09/05/2025, 5:35 AM

@Marvin I do all this in my custom AMI creation (which I base my head and worker nodes off of) # Function to execute remote commands via SSH exec_remote_cmd() { local command="_$1_" echo " → Executing: $command" ssh -i "$key_path_and_name" -o StrictHostKeyChecking=no ec2-user@"$public_ip" "$command" } # Basic system updates and tools exec_remote_cmd "sudo yum update -y && sudo yum upgrade -y" exec_remote_cmd "sudo yum groupinstall -y 'Development Tools'" # Install necessary libraries and dependencies exec_remote_cmd "sudo yum install -y openssl11 openssl11-devel bzip2-devel libffi-devel sqlite-devel" exec_remote_cmd "sudo yum install -y sqlite" exec_remote_cmd "sudo yum install -y nano" exec_remote_cmd "sudo yum install -y wget" exec_remote_cmd "sudo yum install -y zlib-devel" exec_remote_cmd "sudo yum install -y ncurses-devel" exec_remote_cmd "sudo yum install -y ncurses-compat-libs" exec_remote_cmd "sudo yum install -y readline-devel" exec_remote_cmd "sudo yum install -y sqlite-devel" exec_remote_cmd "sudo yum install -y gdbm-devel" exec_remote_cmd "sudo yum install -y expat-devel" exec_remote_cmd "sudo yum install -y xz-devel" exec_remote_cmd "sudo yum install -y tk-devel" exec_remote_cmd "sudo yum install -y uuid-devel" exec_remote_cmd "sudo yum install -y rsync" # Setup swapfile for additional memory exec_remote_cmd "sudo fallocate -l 2G /swapfile" exec_remote_cmd "sudo chmod 600 /swapfile" exec_remote_cmd "sudo mkswap /swapfile" exec_remote_cmd "sudo swapon /swapfile" exec_remote_cmd "sudo swapon --show" exec_remote_cmd "echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab" # Docker installation and setup exec_remote_cmd "sudo yum install -y docker" exec_remote_cmd "sudo systemctl enable docker" exec_remote_cmd "sudo systemctl start docker" exec_remote_cmd "sudo usermod -aG docker ec2-user" # Amazon ECR credential helper setup exec_remote_cmd "sudo yum install -y amazon-ecr-credential-helper" exec_remote_cmd "mkdir -p /home/ec2-user/.docker" # Create docker config file locally cat > docker_config.json << EOF { "credsStore": "ecr-login", "credHelpers": { "public.ecr.aws": "ecr-login", "${AWS_ACCOUNT_ID}.dkr.ecr.ap-south-1.amazonaws.com": "ecr-login" } } EOF # Copy the file directly to the destination on the remote instance scp -i "$key_path_and_name" -o StrictHostKeyChecking=no docker_config.json ec2-user@"$public_ip":/home/ec2-user/.docker/config.json # Clean up local temp file rm -f docker_config.json exec_remote_cmd "sudo chown -R ec2-user:ec2-user /home/ec2-user/.docker" # Python 3.8 installation for Amazon Linux 2 (Prefect 2.14.20 supports Python 3.8+) exec_remote_cmd "sudo amazon-linux-extras install python3.8" exec_remote_cmd "sudo yum install -y python38-devel git" exec_remote_cmd "curl https://pyenv.run | bash" exec_remote_cmd "echo 'export PYENV_ROOT=\"\$HOME/.pyenv\"' >> ~/.bashrc" exec_remote_cmd "echo 'command -v pyenv >/dev/null || export PATH=\"\$PYENV_ROOT/bin:\$PATH\"' >> ~/.bashrc" exec_remote_cmd "echo 'eval \"\$(pyenv init -)\"' >> ~/.bashrc" exec_remote_cmd "source ~/.bashrc" # Install Python 3.11.10 exec_remote_cmd "export PYENV_ROOT=\"\$HOME/.pyenv\" && export PATH=\"\$PYENV_ROOT/bin:\$PATH\" && eval \"\$(pyenv init -)\" && pyenv install 3.11.10" exec_remote_cmd "export PYENV_ROOT=\"\$HOME/.pyenv\" && export PATH=\"\$PYENV_ROOT/bin:\$PATH\" && eval \"\$(pyenv init -)\" && pyenv global 3.11.10" exec_remote_cmd "sudo alternatives --install /usr/bin/python3 python3 /home/ec2-user/.pyenv/versions/3.11.10/bin/python3 1" exec_remote_cmd "sudo alternatives --install /usr/bin/pip3 pip3 /home/ec2-user/.pyenv/versions/3.11.10/bin/pip3 1" exec_remote_cmd "sudo alternatives --install /usr/bin/python python /home/ec2-user/.pyenv/versions/3.11.10/bin/python3 1" exec_remote_cmd "sudo alternatives --install /usr/bin/pip pip /home/ec2-user/.pyenv/versions/3.11.10/bin/pip3 1" # Python packages installation and upgrades exec_remote_cmd "pip3 install --upgrade --force-reinstall cffi" exec_remote_cmd "pip3 install -U urllib3" exec_remote_cmd "pip3 install -U botocore" exec_remote_cmd "pip3 install -U boto3" exec_remote_cmd "pip3 install -U 'ray==2.9.2'" exec_remote_cmd "pip3 show ray" exec_remote_cmd "pip3 install 'fastapi==0.108.0'" exec_remote_cmd "pip3 install 'starlette==0.32.0'" exec_remote_cmd "pip3 install 'prefect==2.14.20'" exec_remote_cmd "pip3 show prefect" exec_remote_cmd "pip3 install 'prefect-ray==0.3.4'" exec_remote_cmd "pip3 install 'griffe<1.0.0'" exec_remote_cmd "pip3 install 'protobuf==3.20.3'" exec_remote_cmd "pip3 install 'requests==2.31.0'" exec_remote_cmd "pip3 install 'prefect-docker==0.4.5'"

Marvin

09/05/2025, 5:35 AM

Your message was too long, here's your message at the allowed limit: I do all this in my custom AMI creation (which I base my head and worker nodes off of) # Function to execute remote commands via SSH exec_remote_cmd() { local command= $1 echo → Executing $command ssh -i $key_path_and_name -o StrictHostKeyChecking=no ec2-user@ $public_ip $command } # Basic system updates and tools exec_remote_cmd sudo yum update -y && sudo yum upgrade -y exec_remote_cmd sudo yum groupinstall -y 'Development Tools' # Install necessary libraries and dependencies exec_remote_cmd sudo yum install -y openssl11 openssl11-devel bzip2-devel libffi-devel sqlite-devel exec_remote_cmd sudo yum install -y sqlite exec_remote_cmd sudo yum install -y nano exec_remote_cmd sudo yum install -y wget exec_remote_cmd sudo yum install -y zlib-devel exec_remote_cmd sudo yum install -y ncurses-devel exec_remote_cmd sudo yum install -y ncurses-compat-libs exec_remote_cmd sudo yum install -y readline-devel exec_remote_cmd sudo yum install -y sqlite-devel exec_remote_cmd sudo yum install -y gdbm-devel exec_remote_cmd sudo yum install -y expat-devel exec_remote_cmd sudo yum install -y xz-devel exec_remote_cmd sudo yum install -y tk-devel exec_remote_cmd sudo yum install -y uuid-devel exec_remote_cmd sudo yum install -y rsync # Setup swapfile for additional memory exec_remote_cmd sudo fallocate -l 2G /swapfile exec_remote_cmd sudo chmod 600 /swapfile exec_remote_cmd sudo mkswap /swapfile exec_remote_cmd sudo swapon /swapfile exec_remote_cmd sudo swapon --show exec_remote_cmd echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab # Docker installation and setup exec_remote_cmd sudo yum install -y docker exec_remote_cmd sudo systemctl enable docker exec_remote_cmd sudo systemctl start docker exec_remote_cmd sudo usermod -aG docker ec2-user # Amazon ECR credential helper setup exec_remote_cmd sudo yum install -y amazon-ecr-credential-helper exec_remote_cmd mkdir -p /home/ec2-user/ docker # Create docker config file locally cat > docker_config json << EOF { credsStore ecr-login credHelpers { public ecr aws ecr-login ${AWS_ACCOUNT_ID} <http //dkr ecr ap-south-1 amazonaws com|dkr ecr ap-south-1 amazonaws com> ecr-login } } EOF # Copy the file directly to the destination on the remote instance scp -i $key_path_and_name -o StrictHostKeyChecking=no docker_config json ec2-user@ $public_ip /home/ec2-user/ docker/config json # Clean up local temp file rm -f docker_config json exec_remote_cmd sudo chown -R ec2-user ec2-user /home/ec2-user/ docker # Python 3 8 installation for Amazon Linux 2 (Prefect 2 14 20 supports Python 3 8+) exec_remote_cmd sudo amazon-linux-extras install python3 8 exec_remote_cmd sudo yum install -y python38-devel git exec_remote_cmd curl <https //pyenv run> | bash exec_remote_cmd echo 'export PYENV_ROOT=\ \$HOME/ pyenv\ ' >> ~/ bashrc exec_remote_cmd echo 'command -v pyenv >/dev/null || export PATH=\ \$PYENV_ROOT/bin \$PATH\ ' >> ~/ bashrc exec_remote_cmd echo 'eval \ \$(pyenv init -)\ ' >> ~/ bashrc exec_remote_cmd source ~/ bashrc # Install Python 3 11 10 exec_remote_cmd export PYENV_ROOT=\ \$HOME/ pyenv\ && export PATH=\ \$PYENV_ROOT/bin \$PATH\ && eval \ \$(pyenv init -)\ && pyenv install 3 11 10 exec_remote_cmd export PYENV_ROOT=\ \$HOME/ pyenv\ && export PATH=\ \$PYENV_ROOT/bin \$PATH\ && eval \ \$(pyenv init -)\ && pyenv global 3 11 10 exec_remote_cmd sudo alternatives --install /usr/bin/python3 python3 /home/ec2-user/ pyenv/versions/3 11 10/bin/python3 1 exec_remote_cmd sudo alternatives --install /usr/bin/pip3 pip3 /home/ec2-user/ pyenv/versions/3 11 10/bin/pip3 1 exec_remote_cmd sudo alternatives --install /usr/bin/python python /home/ec2-user/ pyenv/versions/3 11 10/bin/python3 1 exec_remote_cmd sudo alternatives --install /usr/bin/pip pip /home/ec2-user/ pyenv/versions/3

Saad80

09/05/2025, 5:39 AM

@Marvin I do all this in my custom AMI creation (which I base my head and worker nodes off of) # Function to execute remote commands via SSH exec_remote_cmd() { local command= $1 echo → Executing $command ssh -i $key_path_and_name -o StrictHostKeyChecking=no ec2-user@ $public_ip $command } # Basic system updates and tools exec_remote_cmd sudo yum update -y && sudo yum upgrade -y exec_remote_cmd sudo yum groupinstall -y 'Development Tools' # Install necessary libraries and dependencies exec_remote_cmd sudo yum install -y openssl11 openssl11-devel bzip2-devel libffi-devel sqlite-devel exec_remote_cmd sudo yum install -y sqlite exec_remote_cmd sudo yum install -y nano exec_remote_cmd sudo yum install -y wget exec_remote_cmd sudo yum install -y zlib-devel exec_remote_cmd sudo yum install -y ncurses-devel exec_remote_cmd sudo yum install -y ncurses-compat-libs exec_remote_cmd sudo yum install -y readline-devel exec_remote_cmd sudo yum install -y sqlite-devel exec_remote_cmd sudo yum install -y gdbm-devel exec_remote_cmd sudo yum install -y expat-devel exec_remote_cmd sudo yum install -y xz-devel exec_remote_cmd sudo yum install -y tk-devel exec_remote_cmd sudo yum install -y uuid-devel exec_remote_cmd sudo yum install -y rsync # Setup swapfile for additional memory exec_remote_cmd sudo fallocate -l 2G /swapfile exec_remote_cmd sudo chmod 600 /swapfile exec_remote_cmd sudo mkswap /swapfile exec_remote_cmd sudo swapon /swapfile exec_remote_cmd sudo swapon --show exec_remote_cmd echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab # Docker installation and setup exec_remote_cmd sudo yum install -y docker exec_remote_cmd sudo systemctl enable docker exec_remote_cmd sudo systemctl start docker exec_remote_cmd sudo usermod -aG docker ec2-user # Amazon ECR credential helper setup exec_remote_cmd sudo yum install -y amazon-ecr-credential-helper exec_remote_cmd mkdir -p /home/ec2-user/ docker # Create docker config file locally cat > docker_config json << EOF { credsStore ecr-login credHelpers { public ecr aws ecr-login ${AWS_ACCOUNT_ID} <http //dkr ecr ap-south-1 amazonaws com|dkr ecr ap-south-1 amazonaws com> ecr-login } } EOF # Copy the file directly to the destination on the remote instance scp -i $key_path_and_name -o StrictHostKeyChecking=no docker_config json ec2-user@ $public_ip /home/ec2-user/ docker/config json # Clean up local temp file rm -f docker_config json exec_remote_cmd sudo chown -R ec2-user ec2-user /home/ec2-user/ docker # Python 3 8 installation for Amazon Linux 2 (Prefect 2 14 20 supports Python 3 8+) exec_remote_cmd sudo amazon-linux-extras install python3 8 exec_remote_cmd sudo yum install -y python38-devel git exec_remote_cmd curl <https //pyenv run> | bash exec_remote_cmd echo 'export PYENV_ROOT=\ \$HOME/ pyenv\ ' >> ~/ bashrc exec_remote_cmd echo 'command -v pyenv >/dev/null || export PATH=\ \$PYENV_ROOT/bin \$PATH\ ' >> ~/ bashrc exec_remote_cmd echo 'eval \ \$(pyenv init -)\ ' >> ~/ bashrc exec_remote_cmd source ~/ bashrc # Install Python 3 11 10 exec_remote_cmd export PYENV_ROOT=\ \$HOME/ pyenv\ && export PATH=\ \$PYENV_ROOT/bin \$PATH\ && eval \ \$(pyenv init -)\ && pyenv install 3 11 10 exec_remote_cmd export PYENV_ROOT=\ \$HOME/ pyenv\ && export PATH=\ \$PYENV_ROOT/bin \$PATH\ && eval \ \$(pyenv init -)\ && pyenv global 3 11 10 exec_remote_cmd sudo alternatives --install /usr/bin/python3 python3 /home/ec2-user/ pyenv/versions/3 11 10/bin/python3 1 exec_remote_cmd sudo alternatives --install /usr/bin/pip3 pip3 /home/ec2-user/ pyenv/versions/3 11 10/bin/pip3 1 exec_remote_cmd sudo alternatives --install /usr/bin/python python /home/ec2-user/ pyenv/versions/3 11 10/bin/python3 1 exec_remote_cmd sudo alternatives --install

Marvin

09/05/2025, 5:39 AM

✅ thought for 107.2 seconds

Marvin

09/05/2025, 5:41 AM

Got it — thanks for sharing your AMI setup. Given that you’re running the tasks in containers via Ray runtime_env on EC2, the “podman not found” lines in the ECS flow-runner logs are almost certainly coming from the Ray nodes when Ray tries to launch containers for each task. Key points - Prefect doesn’t choose Podman/Docker; Ray does on the EC2 nodes. - Those containers are launched on the Ray nodes, not in the ECS flow-runner container. Fixes must happen on the EC2 nodes (head and workers). - Ray will try a “Docker-compatible” runtime. If Docker is inaccessible (e.g., permissions) or not found in PATH for the user running Ray, Ray may fall back to trying

podman

, which yields “podman: command not found”. What to adjust in your AMI/provisioning 1) Ensure Docker is usable by the Ray process user at Ray start time - You already install and start Docker and add

ec2-user

to the

docker

group — good. - Critical: make sure the user that runs Ray can use Docker in that same session. Group membership changes don’t apply to already-running processes. If you start Ray before the group change is effective, Docker commands will fail with permission denied and Ray may try Podman next. Suggestions: - Start Ray only after verifying Docker access as `ec2-user`:

Copy code

sudo -u ec2-user -g docker docker info
  sudo -u ec2-user -g docker docker pull public.ecr.aws/hello-world:latest || true

- If you run Ray via systemd, include: -

After=docker.service

and

Requires=docker.service

User=ec2-user

SupplementaryGroups=docker

Environment=PATH=/usr/local/bin:/usr/bin:/bin

- Optionally

Environment=DOCKER_HOST=unix:///var/run/docker.sock

Example Ray head unit:

Copy code

sudo tee /etc/systemd/system/ray-head.service >/dev/null <<'EOF'
[Unit]
Description=Ray Head
After=docker.service
Requires=docker.service

[Service]
User=ec2-user
Group=ec2-user
SupplementaryGroups=docker
Environment=PATH=/usr/local/bin:/usr/bin:/bin
Environment=DOCKER_HOST=unix:///var/run/docker.sock
ExecStart=/home/ec2-user/.pyenv/versions/3.11.10/bin/ray start --head --port=10001 --dashboard-host=0.0.0.0
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ray-head.service

Do the same for worker nodes. 2) Point Ray to Docker’s socket explicitly (avoid Podman shims) - Set DOCKER_HOST for Ray’s process:

Copy code

Environment=DOCKER_HOST=unix:///var/run/docker.sock

- If a podman-docker shim binds /var/run/docker.sock to Podman on your AMI, remove it or ensure DOCKER_HOST points to the real Docker socket. - Confirm on a Ray node:

Copy code

echo $DOCKER_HOST
  which docker
  docker info
  which podman || true

3) ECR auth for the user running Ray - You configured the ECR credential helper under

/home/ec2-user/.docker/config.json

— great if Ray runs as ec2-user. - If Ray runs as root (not recommended here), root won’t see ec2-user’s Docker config. Either: - Run Ray as ec2-user (preferred), or - Also configure

/root/.docker/config.json

with:

Copy code

{
  "credsStore": "ecr-login"
}

- You can also validate from the Ray user:

Copy code

sudo -u ec2-user -g docker docker pull <your-ecr-uri>

4) Match your runtime_env shape to your Ray version - Ray 2.9+:

Copy code

with remote_options(
      num_cpus=1,
      runtime_env={
          "container": {
              "image": "<your ECR URI>",
              "pull_before_run": True
          }
      }
  ):
      test.map(args)

- Ray 2.4–2.8:

Copy code

with remote_options(
      num_cpus=1,
      runtime_env={"image_uri": "<your ECR URI>"}
  ):
      test.map(args)

5) Validate on a Ray node before running via Prefect - From one Ray node: ``` python - <<'PY' import ray, os, subprocess ray.init(address="ray//<HEAD IP>10001") @ray.remote(num_cpus=0.1, runtime_env={"container":{"image":"<your ECR URI>","pull_before_run":True}}) def f(): return "ok from container"

Marvin

09/05/2025, 5:41 AM

print(ray.get(f.remote())) PY ``` - If this fails with “podman not found”, then the Ray process still can’t use docker. Check docker permissions and DOCKER_HOST as above. Why you see “podman not found” in ECS logs - The ECS flow-runner is just the Ray driver and streams worker logs. The failure is on the EC2 Ray node where the container should be launched. If you can share: - Ray version on the nodes - How you start Ray (systemd unit or shell, and as which user) - The exact “podman not found” snippet - Output of

sudo -u ec2-user -g docker docker info

on a node …I can suggest exact AMI changes (including a drop-in for your usermod timing) to make this robust. Reference: - Prefect + Ray integration: Prefect Ray integration

8 Views

Open in Slack

Previous Next