< Marvin> my azurecontainerworker is showing offline in pref Prefect Community #marvin-ai

<@ULVA73B9P> my azurecontainerworker is showing of...

Henry Bruce

09/24/2025, 10:55 AM

@Marvin my azurecontainerworker is showing offline in prefect cloud but I can see active logs from the worker's container instance in Azure. What's the deal?

Marvin

09/24/2025, 10:55 AM

✅ thought for 564.9 seconds

Marvin

09/24/2025, 11:04 AM

Short answer: “Offline” in Cloud means Prefect hasn’t heard a poll/heartbeat from that worker recently. Your ACI container can be running and producing logs, but if it can’t reach Prefect Cloud (or is pointed at the wrong workspace/pool), Cloud will show it as Offline. Most common causes (Prefect 3.x): - Missing/misconfigured Cloud credentials in the worker container -

PREFECT_API_URL

must be the workspace-scoped URL (includes your account_id/workspace_id), not just https://api.prefect.cloud -

PREFECT_API_KEY

must be present and valid - Ref: Connect to Cloud - Worker is connected to a different workspace or polling a different work pool than the one you’re viewing in the UI - Network egress blocked (VNet, firewall, DNS) or Cloud IP allowlist is enabled and does not include the worker’s egress IP - You can check/manage allowlists with

prefect cloud ip-allowlist ...

- The ACI container is running but the Prefect process isn’t (e.g., entrypoint not running

prefect worker start ...

) - Rare: running Prefect 2.x code/image by accident Quick checks to run inside the worker container: - Verify Prefect can see your Cloud workspace and credentials

Copy code

prefect version
  prefect config view
  prefect cloud workspace ls

Expect

PREFECT_API_URL

to look like:

Copy code

<https://api.prefect.cloud/api/accounts/<account_id>/workspaces/<workspace_id>>

- Confirm the worker is started against the correct pool

Copy code

ps aux | grep "prefect worker start"

It should be running something like:

Copy code

prefect worker start --pool <your-azure-container-pool>

- If you use IP allowlists, see what’s configured:

Copy code

prefect cloud ip-allowlist ls

- From inside the container, quickly test reachability

Copy code

curl -sS -H "Authorization: Bearer $PREFECT_API_KEY" "$PREFECT_API_URL/health"

Should return a simple OK/healthy response. Notes: - The healthcheck server (

--with-healthcheck

) is optional and for container orchestration liveness/readiness; the Cloud “Online/Offline” badge is based on recent polls/heartbeats, not that endpoint. Details: Worker healthchecks - Azure container worker docs: Azure Container worker If you can share: - The exact command/entrypoint you use to start the worker - The

PREFECT_API_URL

value inside the container (you can redact IDs if you prefer) - Whether your Cloud account uses an IP allowlist - The work pool name/type you expect the worker to poll …I can pinpoint the root cause quickly.

Henry Bruce

09/24/2025, 11:49 AM

@Marvin what will happen if I do a container-level restart on the worker?

Marvin

09/24/2025, 11:49 AM

✅ thought for 32.2 seconds

Marvin

09/24/2025, 11:49 AM

Good question. Short version: restarting the ACI container that runs your Azure Container Worker is generally safe. What happens when you restart the worker container: - Running flow runs are not interrupted - With the Azure Container worker, each flow run executes in its own Azure Container Instance (separate container group) created by the worker. Those job containers report their state directly to Prefect Cloud, so they continue even if the worker restarts. - New work will pause briefly - While the worker is restarting, it won’t poll your work pool. Any scheduled/queued runs just wait and will be picked up once the worker is back online. - A run that was in the “starting” window may be delayed - In the rare case you restart exactly while a job is being submitted, that run might stay Scheduled/Late and will be picked up on the next poll; worst case you can retry it from the UI. - You may see a “new” worker instance in Cloud - If you don’t pass a fixed

--name

, Prefect generates a new worker name on each start. Consider setting

--name

so Cloud shows the same worker identity after restarts. If your goal is to fix the Offline status: a restart will only help if the problem was transient. If the root cause is credentials or networking, it’ll come back offline. After restart, verify inside the container:

Copy code

prefect config view
prefect cloud workspace ls

Ensure: -

PREFECT_API_URL

is your workspace-scoped URL (includes your account_id/workspace_id) -

PREFECT_API_KEY

is set and valid - The worker is started against the expected pool:

Copy code

prefect worker start --pool <your-azure-container-pool> --name <your-fixed-name>

Relevant docs: - Azure Container worker: Azure Container worker - Worker health/online status: Worker healthchecks If you want, share your worker start command/entrypoint and the

PREFECT_API_URL

format you see in the container, and I can double-check it’ll come back online after the restart.

Henry Bruce

09/24/2025, 11:58 AM

@Marvin it's come back online and the worker version upgraded from 3.4.15 to 3.4.19. It's never been clear to me how/why/when these automatic worker upgrades happen, which they do seem to do. what's the criteria for the worker updating itself? in the past we've seen one case of an unexplained conflict between the worker version and runner version (which we have pegged in our docker image)

Marvin

09/24/2025, 11:58 AM

✅ thought for 181.2 seconds

Marvin

09/24/2025, 12:01 PM

Great data point — that version jump after a restart is the clue. Key point: Prefect workers do not “self-update.” Any upgrade you see is coming from your container image or startup script. What typically triggers a worker version change - Floating image tags: If your worker container uses a tag like

prefecthq/prefect:3

3-latest

, or

latest

, ACI will pull the newest image for that tag on container start/restart. That’s the most common cause of “it upgraded when I restarted.” - Startup scripts that upgrade Prefect: If your container entrypoint or bootstrap runs something like

pip install -U "prefect>=3,<4"

(or just

pip install -U prefect

), it will pull a newer patch on each restart. - Managed workers only: The only case where Prefect controls the worker version is the “prefect:managed” work pool type. Your Azure Container worker is not managed by Prefect Cloud, so this does not apply here. How to prevent surprise upgrades (recommended) - Pin immutable image tags for the worker container: - Prefer something like

prefecthq/prefect:3.4.19-python3.11

(or your exact build) instead of

:3

:latest

. - Even better, use a digest (

@sha256:...

) for fully immutable pulls. - Remove/avoid any

pip install -U prefect

in the worker’s entrypoint. - Rebuild and redeploy intentionally when you want to move versions. About worker vs runner version “conflicts” - The worker and the flow-run (runner) containers can run different Prefect versions, but keeping them aligned avoids oddities. - Best practice: pin both worker and flow-run images to the same Prefect x.y.z version. At minimum, keep them on the same minor version (x.y). - Mismatches can show up as: - Job template/variable keys that one side doesn’t recognize - Differences in how the run is started or monitored - Verify with:

Copy code

# inside worker container
  prefect version

  # inside the flow-run image
  prefect version

- If you’re using floating tags for the worker but a pinned Prefect version for the runner, you can end up with exactly the “unexplained conflict” you saw. If you share: - The worker container image tag (and whether ACI is pulling from a tag like

:3

:latest

) - Any entrypoint snippet that installs/updates Prefect at startup - The flow-run image’s Prefect version pin …I can suggest concrete changes to make upgrades fully predictable. Helpful docs: - Azure Container worker: Azure Container worker - Worker concepts: Workers

Henry Bruce

09/24/2025, 12:28 PM

@Marvin we used the below command one-time to create the worker using the Azure CLI. This pins the version, so where's the upgrade coming from? We're not re-running this command!

Copy code

az container create --name una-acp-qa-prefect-3-3-4-workpool-hybrid --resource-group una-acp-qa-prefect-workpool-rg --assign-identity <identity> --image "prefecthq/prefect:3.3.4-python3.12" --secure-environment-variables PREFECT_API_URL=<URL> PREFECT_API_KEY=<KEY> --command-line "/bin/bash -c 'pip install prefect-azure && prefect worker start --pool una-acp-qa-prefect-3-3-4-workpool-hybrid --type azure-container-instance'" --registry-username <user> --registry-password <pass> --registry-login-server <http://index.docker.io|index.docker.io>

Marvin

09/24/2025, 12:28 PM

✅ thought for 43.0 seconds

Marvin

09/24/2025, 12:29 PM

You’re pinned at the image level, but your startup command is upgrading Prefect at runtime. This bit is the culprit:

Copy code

/bin/bash -c 'pip install prefect-azure && prefect worker start --pool ... --type azure-container-instance'

prefect-azure

does not pin

prefect

to your image’s version. Its dependency spec allows newer Prefect, so pip resolves the latest compatible release and upgrades

prefect

inside the container on every restart. - That explains why you saw 3.4.15 before, then 3.4.19 after a restart, even though the image tag is

prefecthq/prefect:3.3.4-python3.12

. How to stop surprise upgrades - Best: bake dependencies into the image and don’t run pip at startup. - Build your own image FROM

prefecthq/prefect:<x.y.z>-python<py>

and add:

Copy code

pip install "prefect==<x.y.z>" "prefect-azure==<a.b.c>"

- Then run only:

Copy code

prefect worker start --pool ... --type azure-container-instance

- If you must install at startup, pin explicitly and/or add constraints:

Copy code

pip install "prefect==3.4.19" "prefect-azure==<a.b.c>"
  # or keep minor fixed:
  pip install "prefect>=3.4.0,<3.5" "prefect-azure>=0.4,<0.5"

Note: the range form can still drift patch versions; use exact pins if you want zero drift. - Alternatively, use digest-pinned images and remove pip entirely. Why you saw worker/runner mismatches - Your runner image is pegged, but the worker upgraded itself via

pip install prefect-azure

. This can cause incompatibilities. Pin both worker and runner to the same Prefect version (ideally exact x.y.z) and the matching

prefect-azure

version. If you’d like, share the

prefect-azure

version you intend to use and the Prefect version in your runner image; I’ll suggest concrete pins and a minimal Dockerfile/ACI command to keep them perfectly aligned. Docs for reference: - Azure Container worker: Azure Container worker - Worker concepts: Workers

Henry Bruce

09/24/2025, 12:31 PM

@Marvin that explains an upgrade on restart - but other than my manual restart just now, the worker still ends up restarting - is there no scenario whereby the worker chooses to restart itself (not expecting that the version would change of course).

Marvin

09/24/2025, 12:31 PM

✅ thought for 29.7 seconds

Marvin

09/24/2025, 12:31 PM

Short answer: Prefect workers do not intentionally restart themselves. There’s no self-restart or auto-update mechanism in the worker. If your ACI worker “restarts,” it’s because the container process exited and Azure Container Instances restarted it per its restart policy, or the platform rescheduled it. Common reasons an ACI worker restarts - ACI restartPolicy: If set to Always (ACI’s common default), any process exit restarts the container. - Command chain failures: Your startup uses

bash -c 'pip install prefect-azure && prefect worker start ...'

- If the pip step fails (network hiccup, rate limit, dependency resolution), the

&&

prevents

prefect worker start

and the shell exits non-zero → container restarts. - Unhandled exceptions in the worker process: e.g., invalid PREFECT_API_KEY/URL, permission issues with the work pool, etc. The process exits → restart. - OOM kill or resource pressure: If memory/CPU is too tight, the process can be killed and restarted. - Platform events: ACI host maintenance or rescheduling can restart the container. - Liveness probes (if configured): A failing probe will cause restarts. Your CLI snippet didn’t include probes, so likely not this. What to check in Azure to confirm cause - Restart policy and events

Copy code

az container show -g <rg> -n <name> --query "{restartPolicy:restartPolicy, events:containers[0].instanceView.events}"

- Exit codes and previous state

Copy code

az container show -g <rg> -n <name> --query "{current:containers[0].instanceView.currentState, previous:containers[0].instanceView.previousState}"

- Logs around the restart time for pip or OOM/Killed messages

Copy code

az container logs -g <rg> -n <name> --follow

How to reduce surprise restarts - Remove pip from startup and bake dependencies into the image (best practice) - Build FROM

prefecthq/prefect:<x.y.z>-python<py>

and

pip install "prefect==<x.y.z>" "prefect-azure==<a.b.c>"

, then run only:

Copy code

prefect worker start --pool ... --type azure-container-instance

- If you must install at startup, pin and add retries:

Copy code

pip install --no-cache-dir --retries 5 "prefect==<x.y.z>" "prefect-azure==<a.b.c>" || exit 1
  prefect worker start --pool ... --type azure-container-instance

- Give the container enough memory/CPU to avoid OOM. - Set

PREFECT_LOGGING_LEVEL=DEBUG

to capture more detail if the worker exits. TL;DR: The worker won’t choose to restart itself; container restarts are driven by ACI because the process exited. Given your command includes a runtime pip install, transient failures there are a prime suspect. Baking dependencies into the image will make restarts far less frequent and will stop version drift entirely. References: - Workers: Workers - Azure Container worker: Azure Container worker

Henry Bruce

09/24/2025, 12:33 PM

@Marvin explain "or the platform rescheduled it." which platform? Prefect cloud?

Marvin

09/24/2025, 12:33 PM

✅ thought for 18.0 seconds

Marvin

09/24/2025, 12:33 PM

By “platform,” I meant Azure Container Instances (ACI) — not Prefect Cloud. - Prefect Cloud does not restart user-managed workers. It only coordinates flow runs and records worker heartbeats. - Your worker was created directly in ACI via

az container create

, so any restarts/rescheduling are driven by ACI: - Process exited (your bash/pip step failed or the worker crashed) and restartPolicy=Always brought it back - OOM/kill or host maintenance - Transient platform/network issues during image pulls or startup To confirm cause in Azure:

Copy code

az container show -g <rg> -n <name> --query "{restartPolicy:restartPolicy, events:containers[0].instanceView.events, current:containers[0].instanceView.currentState, previous:containers[0].instanceView.previousState}"

Look for non-zero exit codes, “Killed”/OOM, or maintenance events around the restart time. Only exception: if you were using a Prefect-managed work pool (

prefect:managed

), Prefect provisions and operates the infra; in that scenario Prefect could restart/replace worker pods. But that’s not the case for an Azure Container Instance worker you run yourself. References: - Workers overview: Workers - Azure Container worker: Azure Container worker

Open in Slack

Previous Next