Thomas Cobban
09/08/2025, 4:25 AMMarvin
09/08/2025, 4:25 AMMarvin
09/08/2025, 4:26 AM@coiled.function(...)
which tells Coiled to launch workers using a different image (v2.0).
- The error “Failed pulling software environment image, error code 1” is Coiled failing to pull the container="bathymaps/prefect-coiled:v2.0"
image in the target region (ap-southeast-2).
Common causes and quick checks
1) Tag mismatch
- You’re using v1.0
for the Prefect deployment and v2.0
for Coiled. If v2.0
wasn’t pushed or is private, Coiled can’t pull it.
- Recommendation: either align both to the same, known-good tag or verify v2.0
exists and is accessible.
2) Image accessibility and registry auth
- Is bathymaps/prefect-coiled:v2.0
public? If private, Coiled needs registry credentials configured for your account.
- Try pulling from a Linux/amd64 host outside your network to verify:
docker pull --platform linux/amd64 bathymaps/prefect-coiled:v2.0
If that fails or prompts for auth, configure Coiled with the appropriate registry credentials (Docker Hub/ECR/GCR) in your Coiled account.
3) Architecture/manifest
- Even if you built with --platform linux/amd64
, confirm the image manifest includes amd64 and that there isn’t a stray arm64-only tag.
docker buildx imagetools inspect bathymaps/prefect-coiled:v2.0
Ensure linux/amd64 is listed.
4) Region and registry rate limits
- Coiled will pull the image in ap-southeast-2. If you’re hitting rate limits (Docker Hub) or experiencing slow pulls, consider hosting the image in a registry local to that region (e.g., AWS ECR in ap-southeast-2) and point container=
to that fully-qualified image.
5) Using both software/container
- You commented out software=...
(good). Make sure you never pass both at once — only one should be active.
Minimal repro to isolate Coiled from Prefect
Run this locally (outside Prefect) to confirm Coiled can pull and run your container in ap-southeast-2:
import coiled
@coiled.function(
name="test-pull",
container="bathymaps/prefect-coiled:v2.0",
region="ap-southeast-2",
n_workers=1,
)
def ping():
import platform, sys
return {"machine": platform.machine(), "python": sys.version}
print(ping())
If this fails with the same error, it confirms an image access/registry/arch issue on the Coiled side.
Recommendations to proceed
- For consistency, align the images (either use v1.0 everywhere or upgrade your Prefect deployment to v2.0):
copernicus_hourly_sst_so_ssh_jaxa_flow.deploy(
name="copernicus_hourly_sst_so_ssh_jaxa_flow",
work_pool_name="bathymaps-coiled-pool",
image="bathymaps/prefect-coiled:v2.0",
build=False,
cron="0 * * * *",
)
- Ensure the Coiled image (the one in container=
) is public or that Coiled has valid credentials for your registry.
- If possible, host the image in ECR in ap-southeast-2 and reference that ECR URL in container=...
to avoid Docker Hub rate limits and speed up pulls.
Could you share:
- Whether bathymaps/prefect-coiled:v2.0
is public or private and which registry it’s on
- The exact Coiled log snippet around the failure (including any “denied”/“not found” lines)
- Confirmation that docker pull --platform linux/amd64 bathymaps/prefect-coiled:v2.0
works from a Linux amd64 machine
With that info, I can help pinpoint whether it’s an auth, tag, or architecture issue.Thomas Cobban
09/08/2025, 4:41 AMMarvin
09/08/2025, 4:43 AMMarvin
09/08/2025, 4:47 AMcontainer=
value than the code you tested locally.
Quick fixes to try
1) Align your images to a single, known-good tag (e.g., v2.0) for both the Prefect deployment and the Coiled container:
# Coiled task
@coiled.function(
name="prefect_jaxa_hourly_sst_median",
container="bathymaps/prefect-coiled:v2.0",
region="ap-southeast-2",
memory="64 GiB",
n_workers=2,
idle_timeout="10m",
environ={"CPL_VSIL_USE_TEMP_FILE_FOR_RANDOM_WRITE": "YES"},
)
# Prefect deploy
copernicus_hourly_sst_so_ssh_jaxa_flow.deploy(
name="copernicus_hourly_sst_so_ssh_jaxa_flow",
work_pool_name="bathymaps-coiled-pool",
image="bathymaps/prefect-coiled:v2.0",
build=False,
cron="0 * * * *",
)
2) Rule out rate limits
- Temporarily set n_workers=1
in the Prefect-run and try again. If that works but 2 fails, it’s likely registry rate-limiting.
- If you hit limits, host the image in ECR in ap-southeast-2 (or another registry close to the region) and set container=
to the fully-qualified ECR URL. This avoids Docker Hub limits and speeds up pulls.
3) Ensure the Prefect image has an up-to-date Coiled client
- Make sure both images include the same coiled
version:
python -c "import coiled, sys; print('coiled:', coiled.__version__, 'python:', sys.version)"
- If your v1.0 image is older, update and re-push or switch your deployment to v2.0 as above.
4) Verify the exact image being pulled in the failing run
- From the Coiled cluster page linked in the failing Prefect run, copy the lines around the pull failure. We want to see the precise image URL/tag and whether it says “not found”, “denied”, or rate-limited.
If you can share:
- The Coiled log snippet from the failing Prefect run (especially the line with the image name and the “denied/not found” text)
- Whether the images are public or private
- The work pool type for “bathymaps-coiled-pool” (Docker, Process, etc.)
…we can pinpoint whether this is tag drift, auth, or rate limits.
References
- Prefect + Coiled integration guide
- Coiled: Using Docker imagesThomas Cobban
09/08/2025, 5:16 AMMarvin
09/08/2025, 5:16 AMMarvin
09/08/2025, 5:18 AMcoiled
version — and we know your container pulls fine in ap-southeast-2.
- When you run via a Prefect deployment, the code executes inside your work pool’s worker container. That environment likely differs:
- Different or missing Coiled credentials/config (COILED_API_TOKEN, account defaults, registry auth)
- A different coiled
client version than your local
- Potentially different code or decorator values than what you just tested (if the deployment wasn’t updated)
Given your ping test succeeded, this is almost certainly about the worker environment, not the image tag itself.
Concrete fixes to try
1) Ensure the worker has Coiled credentials and defaults
- In the Prefect UI, open your work pool “bathymaps-coiled-pool” and add environment variables:
- COILED_API_TOKEN=<your token>
- COILED_DEFAULT_REGION=ap-southeast-2
- (If you use a non-default account/org) COILED_ACCOUNT=<your-account>
- Alternatively, pass env at deploy time via job variables (works for Docker/Process pools):
copernicus_hourly_sst_so_ssh_jaxa_flow.deploy(
name="copernicus_hourly_sst_so_ssh_jaxa_flow",
work_pool_name="bathymaps-coiled-pool",
image="bathymaps/prefect-coiled:v2.0",
build=False,
cron="0 * * * *",
job_variables={"env": {
"COILED_API_TOKEN": "...",
"COILED_DEFAULT_REGION": "ap-southeast-2"
}},
)
2) Align versions and config between local and deployment
- Make sure the deployment image has the same coiled
version as your local:
python -c "import coiled, sys; print('coiled:', coiled.__version__, 'python:', sys.version)"
- Keep both the Prefect deployment image and the Coiled container=
tag the same (you already switched to v2.0 — good).
3) Rule out registry rate limits and concurrency
- Temporarily set n_workers=1
in the @coiled.function. If that works but 2
fails, you’re likely hitting registry rate limits. Hosting the image in ECR in ap-southeast-2 and using its full URL in container=
helps a lot.
4) Verify what the worker is actually doing
- Add a tiny debug task at the start of your flow to inspect the runtime:
from prefect import task, get_run_logger
@task
def debug_env():
import coiled, os, platform
logger = get_run_logger()
<http://logger.info|logger.info>(f"coiled version: {getattr(coiled, '__version__', 'unknown')}")
<http://logger.info|logger.info>(f"whoami: {getattr(coiled, 'whoami', lambda: 'n/a')()}")
<http://logger.info|logger.info>(f"COILED_DEFAULT_REGION: {os.getenv('COILED_DEFAULT_REGION')}")
<http://logger.info|logger.info>(f"machine: {platform.machine()}")
Then call debug_env()
at the start of the flow and check the Prefect logs.
5) Double-check the exact image being pulled by Coiled during the failing run
- In the Coiled cluster page linked from the failing run, look for lines around the pull error. If it says “denied” or “not found”, that points to auth/tag issues. If it looks like timeout/rate-limit, that points to registry limits.
A couple of gotchas I’ve seen cause this
- The worker is using a different Coiled token/account than your local. That account does not have registry credentials configured, so private images fail to pull. Configure registry credentials in that Coiled account or make the image public.
- The code deployed is not the same as the code you just tested (e.g., a stale decorator using a different container=
or software=
). Re-deploy from the updated code and confirm the deployment shows your latest commit/hash.
- Environment defaults in the worker image or profile override your container=
(e.g., COILED_SOFTWARE or a default config in ~/.config/coiled). The debug task above will help reveal this.
If you can share:
- The worker type for “bathymaps-coiled-pool” (Docker vs Process)Marvin
09/08/2025, 5:18 AMThomas Cobban
09/08/2025, 5:27 AMMarvin
09/08/2025, 5:27 AMMarvin
09/08/2025, 5:28 AMprefect work-pool inspect "bathymaps-coiled-pool"
Look for an env section in the output.
- If you pass env via Flow.deploy, it will be under job_variables["env"] for that deployment.
- Check inside the running environment (most definitive)
- Add a quick debug task to your flow to see which credentials/config are in effect at runtime:
from prefect import task, get_run_logger
import os, pathlib
@task
def report_coiled_auth():
logger = get_run_logger()
try:
import coiled
who = coiled.whoami()
except Exception as e:
who = f"whoami failed: {e}"
<http://logger.info|logger.info>(f"COILED_API_TOKEN set: {'yes' if os.getenv('COILED_API_TOKEN') else 'no'}")
<http://logger.info|logger.info>(f"COILED_ACCOUNT: {os.getenv('COILED_ACCOUNT')}")
<http://logger.info|logger.info>(f"COILED_DEFAULT_REGION: {os.getenv('COILED_DEFAULT_REGION')}")
<http://logger.info|logger.info>(f"coiled.whoami(): {who}")
cfg = pathlib.Path.home() / ".config" / "coiled" / "config.yaml"
<http://logger.info|logger.info>(f"~/.config/coiled/config.yaml exists: {cfg.exists()}")
Call report_coiled_auth()
at the start of your flow and check the logs.
- Inspect the worker container/host directly
- If you’re using a Docker work pool and have access to the worker container:
docker exec -it <worker-container-id> env | grep COILED
docker exec -it <worker-container-id> sh -lc 'cat ~/.config/coiled/config.yaml || true'
- For a Process worker, check the shell/systemd service/env used to start prefect worker start -p bathymaps-coiled-pool
.
What to set if missing/misconfigured
- Set these env vars in your pool’s base job template or as job_variables on the deployment:
- COILED_API_TOKEN
- COILED_ACCOUNT (if you use a non-default account/org)
- COILED_DEFAULT_REGION=ap-southeast-2 (optional default)
- Ensure your deployment image contains a recent coiled client, same as you tested locally.
If you tell me:
- Your work pool type (Docker, Process, Kubernetes)
- How you’re starting the worker
…I can give exact steps to view/set the env in that environment. Links: Run Prefect on CoiledThomas Cobban
09/08/2025, 5:36 AM{}
}
}
},
'job_configuration': {
'arm': '{{ arm }}',
'cpu': '{{ cpu }}',
'env': '{{ env }}',
'gpu': '{{ gpu }}',
'name': '{{ name }}',
'image': '{{ image }}',
'labels': '{{ labels }}',
'memory': '{{ memory }}',
'region': '{{ region }}',
'command': '{{ command }}',
'software': '{{ software }}',
'vm_types': '{{ vm_types }}',
'workspace': '{{ workspace }}',
'credentials': '{{ credentials }}',
'job_timeout': '{{ job_timeout }}'
}
},
status=WorkPoolStatus.READY,
storage_configuration=WorkPoolStorageConfiguration(),
default_queue_id='56ae0390-cb16-41e5-99a8-2c69f11b6099'
(prefect-coiled) thomascobban@Thomass-MBP prefect-coiled %Marvin
09/08/2025, 5:36 AM{}
} } } 'job_configuration' { 'arm' '{{ arm }}' 'cpu' '{{ cpu }}' 'env' '{{ env }}' 'gpu' '{{ gpu }}' 'name' '{{ name }}' 'image' '{{ image }}' 'labels' '{{ labels }}' 'memory' '{{ memory }}' 'region' '{{ region }}' 'command' '{{ command }}' 'software' '{{ software }}' 'vm_types' '{{ vm_types }}' 'workspace' '{{ workspace }}' 'credentials' '{{ credentials }}' 'job_timeout' '{{ job_timeout }}' } } status=WorkPoolStatusThomas Cobban
09/08/2025, 5:36 AMMarvin
09/08/2025, 5:37 AMMarvin
09/08/2025, 5:38 AMprefect work-pool inspect "bathymaps-coiled-pool"
look for these fields in the output:
1) Work pool type
- Tells you how env is applied.
- Docker/Kubernetes/ECSTask pools: env is defined in the pool’s base job template and passed into the flow-run container.
- Process pool: env is inherited from the machine/process that starts the worker; the pool usually won’t show COILED_* here.
2) base_job_template (critical)
- image: The container your flow run executes in. Make sure it’s the expected tag (e.g., bathymaps/prefect-coiled:v2.0).
- env: The environment variables that will exist inside the flow-run container. You want to see:
- COILED_API_TOKEN (non-empty)
- COILED_ACCOUNT (if you use a non-default account)
- COILED_DEFAULT_REGION=ap-southeast-2 (optional but helpful)
- Any other settings relevant to your runtime (volumes, command, etc.)
A “good” Docker work pool snippet looks like:
"type": "docker",
"base_job_template": {
"image": "bathymaps/prefect-coiled:v2.0",
"env": {
"COILED_API_TOKEN": "****",
"COILED_ACCOUNT": "my-account",
"COILED_DEFAULT_REGION": "ap-southeast-2"
},
...
}
Also check deployment-level overrides
- Your deployment can add/override env. Inspect it with:
prefect deployment inspect "copernicus_hourly_sst_so_ssh_jaxa_flow/copernicus_hourly_sst_so_ssh_jaxa_flow"
- Look for fields like image and job_variables/env (or infrastructure/env). If env is defined here, it merges with or overrides the pool’s base job template.
Notes by pool type
- Docker/Kubernetes/ECS: You should see your COILED_* variables in base_job_template.env (or in the deployment’s env). If they’re missing, add them in the pool’s Base job template (UI) or via deployment job_variables.
- Process: You won’t typically see COILED_* in the pool inspection. Those must be set in the environment that launches the worker (the shell/systemd service running prefect worker). In this case, rely on your debug task inside the flow to verify what’s actually present at runtime.
If anything’s missing, set it
- UI: Work Pools -> your pool -> Base job template -> Environment variables
- Or update the deployment to include:
job_variables={"env": {
"COILED_API_TOKEN": "...",
"COILED_ACCOUNT": "my-account",
"COILED_DEFAULT_REGION": "ap-southeast-2"
}}
If you paste the relevant parts of your inspect output (type, image, env), I can confirm whether it’s configured correctly.Thomas Cobban
09/08/2025, 5:46 AMMarvin
09/08/2025, 5:47 AMMarvin
09/08/2025, 5:47 AMimage
, then your flow-run container probably isn’t getting any Coiled credentials. That’s why it works locally (you have creds/config) but fails when run via the work pool.
You have two good options:
A) Add COILED_* env vars to the deployment
- Update your .deploy(...)
call to pass env via `job_variables`:
import os
copernicus_hourly_sst_so_ssh_jaxa_flow.deploy(
name="copernicus_hourly_sst_so_ssh_jaxa_flow",
work_pool_name="bathymaps-coiled-pool",
image="bathymaps/prefect-coiled:v2.0",
build=False,
cron="0 * * * *",
job_variables={"env": {
"COILED_API_TOKEN": os.getenv("COILED_API_TOKEN"), # set in your shell/CI
"COILED_ACCOUNT": "your-account", # if applicable
"COILED_DEFAULT_REGION": "ap-southeast-2" # optional but helpful
}},
)
- Make sure COILED_API_TOKEN
is set in the environment where you call .deploy()
(so you don’t hardcode secrets).
B) Set env on the work pool base job template (UI)
- Prefect UI -> Work Pools -> bathymaps-coiled-pool -> Edit base job template -> Environment variables:
- COILED_API_TOKEN
- COILED_ACCOUNT
(if you use a non-default account)
- COILED_DEFAULT_REGION=ap-southeast-2
- Save, then restart your worker if needed. New flow runs will inherit these vars.
What to verify after
- Inspect your deployment to confirm env is present:
prefect deployment inspect "copernicus_hourly_sst_so_ssh_jaxa_flow/copernicus_hourly_sst_so_ssh_jaxa_flow"
Look for job_variables.env
with your COILED_* vars (masked).
- Add a quick debug task at the start of your flow to confirm at runtime:
from prefect import task, get_run_logger
import os, pathlib
@task
def report_coiled_auth():
logger = get_run_logger()
try:
import coiled
who = coiled.whoami()
except Exception as e:
who = f"whoami failed: {e}"
<http://logger.info|logger.info>(f"COILED_API_TOKEN set: {'yes' if os.getenv('COILED_API_TOKEN') else 'no'}")
<http://logger.info|logger.info>(f"COILED_ACCOUNT: {os.getenv('COILED_ACCOUNT')}")
<http://logger.info|logger.info>(f"COILED_DEFAULT_REGION: {os.getenv('COILED_DEFAULT_REGION')}")
<http://logger.info|logger.info>(f"coiled.whoami(): {who}")
cfg = pathlib.Path.home() / ".config" / "coiled" / "config.yaml"
<http://logger.info|logger.info>(f"~/.config/coiled/config.yaml exists: {cfg.exists()}")
Notes
- If your Coiled container image is private on Docker Hub/ECR/GCR, ensure your Coiled account has registry credentials configured; otherwise pulls can fail even with a valid Coiled API token. See Coiled: Using Docker images.
- If your work pool is a Process worker, you can also set these env vars in the environment that starts the worker process. For Docker/Kubernetes pools, prefer setting them in the pool’s base job template or the deployment’s job_variables.env
.
If you can share your work pool type (Docker vs Process) and whether your image is public/private, I can give exact “where to put the env” steps. Prefect + Coiled guide: Run Prefect on Coiled.Thomas Cobban
09/08/2025, 5:51 AMMarvin
09/08/2025, 5:51 AMMarvin
09/08/2025, 5:51 AMcoiled.function(...)
can create clusters.
- You can confirm at runtime with a quick check:
@task
def report_coiled_auth():
from prefect import get_run_logger
import os, pathlib
import coiled
logger = get_run_logger()
try:
who = coiled.whoami()
except Exception as e:
who = f"whoami failed: {e}"
<http://logger.info|logger.info>(f"coiled.whoami(): {who}")
<http://logger.info|logger.info>(f"COILED_API_TOKEN set: {'yes' if os.getenv('COILED_API_TOKEN') else 'no'}")
<http://logger.info|logger.info>(f"COILED_ACCOUNT: {os.getenv('COILED_ACCOUNT')}")
cfg = pathlib.Path.home() / ".config" / "coiled" / "config.yaml"
<http://logger.info|logger.info>(f"~/.config/coiled/config.yaml exists: {cfg.exists()}")
What it does not cover
- Docker/registry credentials for pulling your container=
image. If bathymaps/prefect-coiled:v2.0
is private (or if you hit Docker Hub rate limits), Coiled needs registry auth configured in your Coiled account. Without that, you’ll see “Failed pulling software environment image”.
What to do next
- If the image is private or you suspect rate-limits, add registry credentials in Coiled (Docker Hub/ECR/GCR) and re-run:
- Coiled docs on using Docker images and auth: Using Docker images in Coiled
- For best reliability/performance in ap-southeast-2, push the image to ECR in that region and set container=
to the full ECR URL.
- Keep passing region="ap-southeast-2"
in your @coiled.function(...)
(you already do).
- Optionally add COILED_DEFAULT_REGION via your pool’s Base job template env or deployment job_variables.env
, but it’s not required since you set region in code.
Reminder about the Prefect deployment image
- In a Coiled work pool, the “image” you pass to .deploy(image=...)
does not control the Coiled worker container; the container that must be pulled by Coiled is the one in your @coiled.function(container=...)
. Credentials for pulling that image come from your Coiled account’s registry settings, not from the Prefect work pool image.
If you paste the exact Coiled log lines around the failure (e.g., “denied”, “not found”, or rate-limit text), I can confirm whether it’s an auth vs. availability issue.
Prefect + Coiled guide: Run Prefect on Coiled