<@ULVA73B9P> my azurecontainerworker is showing of...
# marvin-ai
h
@Marvin my azurecontainerworker is showing offline in prefect cloud but I can see active logs from the worker's container instance in Azure. What's the deal?
m
thought for 564.9 seconds
Short answer: “Offline” in Cloud means Prefect hasn’t heard a poll/heartbeat from that worker recently. Your ACI container can be running and producing logs, but if it can’t reach Prefect Cloud (or is pointed at the wrong workspace/pool), Cloud will show it as Offline. Most common causes (Prefect 3.x): - Missing/misconfigured Cloud credentials in the worker container -
PREFECT_API_URL
must be the workspace-scoped URL (includes your account_id/workspace_id), not just https://api.prefect.cloud -
PREFECT_API_KEY
must be present and valid - Ref: Connect to Cloud - Worker is connected to a different workspace or polling a different work pool than the one you’re viewing in the UI - Network egress blocked (VNet, firewall, DNS) or Cloud IP allowlist is enabled and does not include the worker’s egress IP - You can check/manage allowlists with
prefect cloud ip-allowlist ...
- The ACI container is running but the Prefect process isn’t (e.g., entrypoint not running
prefect worker start ...
) - Rare: running Prefect 2.x code/image by accident Quick checks to run inside the worker container: - Verify Prefect can see your Cloud workspace and credentials
Copy code
prefect version
  prefect config view
  prefect cloud workspace ls
Expect
PREFECT_API_URL
to look like:
Copy code
<https://api.prefect.cloud/api/accounts/<account_id>/workspaces/<workspace_id>>
- Confirm the worker is started against the correct pool
Copy code
ps aux | grep "prefect worker start"
It should be running something like:
Copy code
prefect worker start --pool <your-azure-container-pool>
- If you use IP allowlists, see what’s configured:
Copy code
prefect cloud ip-allowlist ls
- From inside the container, quickly test reachability
Copy code
curl -sS -H "Authorization: Bearer $PREFECT_API_KEY" "$PREFECT_API_URL/health"
Should return a simple OK/healthy response. Notes: - The healthcheck server (
--with-healthcheck
) is optional and for container orchestration liveness/readiness; the Cloud “Online/Offline” badge is based on recent polls/heartbeats, not that endpoint. Details: Worker healthchecks - Azure container worker docs: Azure Container worker If you can share: - The exact command/entrypoint you use to start the worker - The
PREFECT_API_URL
value inside the container (you can redact IDs if you prefer) - Whether your Cloud account uses an IP allowlist - The work pool name/type you expect the worker to poll …I can pinpoint the root cause quickly.
h
@Marvin what will happen if I do a container-level restart on the worker?
m
thought for 32.2 seconds
Good question. Short version: restarting the ACI container that runs your Azure Container Worker is generally safe. What happens when you restart the worker container: - Running flow runs are not interrupted - With the Azure Container worker, each flow run executes in its own Azure Container Instance (separate container group) created by the worker. Those job containers report their state directly to Prefect Cloud, so they continue even if the worker restarts. - New work will pause briefly - While the worker is restarting, it won’t poll your work pool. Any scheduled/queued runs just wait and will be picked up once the worker is back online. - A run that was in the “starting” window may be delayed - In the rare case you restart exactly while a job is being submitted, that run might stay Scheduled/Late and will be picked up on the next poll; worst case you can retry it from the UI. - You may see a “new” worker instance in Cloud - If you don’t pass a fixed
--name
, Prefect generates a new worker name on each start. Consider setting
--name
so Cloud shows the same worker identity after restarts. If your goal is to fix the Offline status: a restart will only help if the problem was transient. If the root cause is credentials or networking, it’ll come back offline. After restart, verify inside the container:
Copy code
prefect config view
prefect cloud workspace ls
Ensure: -
PREFECT_API_URL
is your workspace-scoped URL (includes your account_id/workspace_id) -
PREFECT_API_KEY
is set and valid - The worker is started against the expected pool:
Copy code
prefect worker start --pool <your-azure-container-pool> --name <your-fixed-name>
Relevant docs: - Azure Container worker: Azure Container worker - Worker health/online status: Worker healthchecks If you want, share your worker start command/entrypoint and the
PREFECT_API_URL
format you see in the container, and I can double-check it’ll come back online after the restart.
h
@Marvin it's come back online and the worker version upgraded from 3.4.15 to 3.4.19. It's never been clear to me how/why/when these automatic worker upgrades happen, which they do seem to do. what's the criteria for the worker updating itself? in the past we've seen one case of an unexplained conflict between the worker version and runner version (which we have pegged in our docker image)
m
thought for 181.2 seconds
Great data point — that version jump after a restart is the clue. Key point: Prefect workers do not “self-update.” Any upgrade you see is coming from your container image or startup script. What typically triggers a worker version change - Floating image tags: If your worker container uses a tag like
prefecthq/prefect:3
,
3-latest
, or
latest
, ACI will pull the newest image for that tag on container start/restart. That’s the most common cause of “it upgraded when I restarted.” - Startup scripts that upgrade Prefect: If your container entrypoint or bootstrap runs something like
pip install -U "prefect>=3,<4"
(or just
pip install -U prefect
), it will pull a newer patch on each restart. - Managed workers only: The only case where Prefect controls the worker version is the “prefect:managed” work pool type. Your Azure Container worker is not managed by Prefect Cloud, so this does not apply here. How to prevent surprise upgrades (recommended) - Pin immutable image tags for the worker container: - Prefer something like
prefecthq/prefect:3.4.19-python3.11
(or your exact build) instead of
:3
or
:latest
. - Even better, use a digest (
@sha256:...
) for fully immutable pulls. - Remove/avoid any
pip install -U prefect
in the worker’s entrypoint. - Rebuild and redeploy intentionally when you want to move versions. About worker vs runner version “conflicts” - The worker and the flow-run (runner) containers can run different Prefect versions, but keeping them aligned avoids oddities. - Best practice: pin both worker and flow-run images to the same Prefect x.y.z version. At minimum, keep them on the same minor version (x.y). - Mismatches can show up as: - Job template/variable keys that one side doesn’t recognize - Differences in how the run is started or monitored - Verify with:
Copy code
# inside worker container
  prefect version

  # inside the flow-run image
  prefect version
- If you’re using floating tags for the worker but a pinned Prefect version for the runner, you can end up with exactly the “unexplained conflict” you saw. If you share: - The worker container image tag (and whether ACI is pulling from a tag like
:3
or
:latest
) - Any entrypoint snippet that installs/updates Prefect at startup - The flow-run image’s Prefect version pin …I can suggest concrete changes to make upgrades fully predictable. Helpful docs: - Azure Container worker: Azure Container worker - Worker concepts: Workers
h
@Marvin we used the below command one-time to create the worker using the Azure CLI. This pins the version, so where's the upgrade coming from? We're not re-running this command!
Copy code
az container create --name una-acp-qa-prefect-3-3-4-workpool-hybrid --resource-group una-acp-qa-prefect-workpool-rg --assign-identity <identity> --image "prefecthq/prefect:3.3.4-python3.12" --secure-environment-variables PREFECT_API_URL=<URL> PREFECT_API_KEY=<KEY> --command-line "/bin/bash -c 'pip install prefect-azure && prefect worker start --pool una-acp-qa-prefect-3-3-4-workpool-hybrid --type azure-container-instance'" --registry-username <user> --registry-password <pass> --registry-login-server <http://index.docker.io|index.docker.io>
m
thought for 43.0 seconds
You’re pinned at the image level, but your startup command is upgrading Prefect at runtime. This bit is the culprit:
Copy code
/bin/bash -c 'pip install prefect-azure && prefect worker start --pool ... --type azure-container-instance'
-
prefect-azure
does not pin
prefect
to your image’s version. Its dependency spec allows newer Prefect, so pip resolves the latest compatible release and upgrades
prefect
inside the container on every restart. - That explains why you saw 3.4.15 before, then 3.4.19 after a restart, even though the image tag is
prefecthq/prefect:3.3.4-python3.12
. How to stop surprise upgrades - Best: bake dependencies into the image and don’t run pip at startup. - Build your own image FROM
prefecthq/prefect:<x.y.z>-python<py>
and add:
Copy code
pip install "prefect==<x.y.z>" "prefect-azure==<a.b.c>"
- Then run only:
Copy code
prefect worker start --pool ... --type azure-container-instance
- If you must install at startup, pin explicitly and/or add constraints:
Copy code
pip install "prefect==3.4.19" "prefect-azure==<a.b.c>"
  # or keep minor fixed:
  pip install "prefect>=3.4.0,<3.5" "prefect-azure>=0.4,<0.5"
Note: the range form can still drift patch versions; use exact pins if you want zero drift. - Alternatively, use digest-pinned images and remove pip entirely. Why you saw worker/runner mismatches - Your runner image is pegged, but the worker upgraded itself via
pip install prefect-azure
. This can cause incompatibilities. Pin both worker and runner to the same Prefect version (ideally exact x.y.z) and the matching
prefect-azure
version. If you’d like, share the
prefect-azure
version you intend to use and the Prefect version in your runner image; I’ll suggest concrete pins and a minimal Dockerfile/ACI command to keep them perfectly aligned. Docs for reference: - Azure Container worker: Azure Container worker - Worker concepts: Workers
h
@Marvin that explains an upgrade on restart - but other than my manual restart just now, the worker still ends up restarting - is there no scenario whereby the worker chooses to restart itself (not expecting that the version would change of course).
m
thought for 29.7 seconds
Short answer: Prefect workers do not intentionally restart themselves. There’s no self-restart or auto-update mechanism in the worker. If your ACI worker “restarts,” it’s because the container process exited and Azure Container Instances restarted it per its restart policy, or the platform rescheduled it. Common reasons an ACI worker restarts - ACI restartPolicy: If set to Always (ACI’s common default), any process exit restarts the container. - Command chain failures: Your startup uses
bash -c 'pip install prefect-azure && prefect worker start ...'
- If the pip step fails (network hiccup, rate limit, dependency resolution), the
&&
prevents
prefect worker start
and the shell exits non-zero → container restarts. - Unhandled exceptions in the worker process: e.g., invalid PREFECT_API_KEY/URL, permission issues with the work pool, etc. The process exits → restart. - OOM kill or resource pressure: If memory/CPU is too tight, the process can be killed and restarted. - Platform events: ACI host maintenance or rescheduling can restart the container. - Liveness probes (if configured): A failing probe will cause restarts. Your CLI snippet didn’t include probes, so likely not this. What to check in Azure to confirm cause - Restart policy and events
Copy code
az container show -g <rg> -n <name> --query "{restartPolicy:restartPolicy, events:containers[0].instanceView.events}"
- Exit codes and previous state
Copy code
az container show -g <rg> -n <name> --query "{current:containers[0].instanceView.currentState, previous:containers[0].instanceView.previousState}"
- Logs around the restart time for pip or OOM/Killed messages
Copy code
az container logs -g <rg> -n <name> --follow
How to reduce surprise restarts - Remove pip from startup and bake dependencies into the image (best practice) - Build FROM
prefecthq/prefect:<x.y.z>-python<py>
and
pip install "prefect==<x.y.z>" "prefect-azure==<a.b.c>"
, then run only:
Copy code
prefect worker start --pool ... --type azure-container-instance
- If you must install at startup, pin and add retries:
Copy code
pip install --no-cache-dir --retries 5 "prefect==<x.y.z>" "prefect-azure==<a.b.c>" || exit 1
  prefect worker start --pool ... --type azure-container-instance
- Give the container enough memory/CPU to avoid OOM. - Set
PREFECT_LOGGING_LEVEL=DEBUG
to capture more detail if the worker exits. TL;DR: The worker won’t choose to restart itself; container restarts are driven by ACI because the process exited. Given your command includes a runtime pip install, transient failures there are a prime suspect. Baking dependencies into the image will make restarts far less frequent and will stop version drift entirely. References: - Workers: Workers - Azure Container worker: Azure Container worker
h
@Marvin explain "or the platform rescheduled it." which platform? Prefect cloud?
m
thought for 18.0 seconds
By “platform,” I meant Azure Container Instances (ACI) — not Prefect Cloud. - Prefect Cloud does not restart user-managed workers. It only coordinates flow runs and records worker heartbeats. - Your worker was created directly in ACI via
az container create
, so any restarts/rescheduling are driven by ACI: - Process exited (your bash/pip step failed or the worker crashed) and restartPolicy=Always brought it back - OOM/kill or host maintenance - Transient platform/network issues during image pulls or startup To confirm cause in Azure:
Copy code
az container show -g <rg> -n <name> --query "{restartPolicy:restartPolicy, events:containers[0].instanceView.events, current:containers[0].instanceView.currentState, previous:containers[0].instanceView.previousState}"
Look for non-zero exit codes, “Killed”/OOM, or maintenance events around the restart time. Only exception: if you were using a Prefect-managed work pool (
prefect:managed
), Prefect provisions and operates the infra; in that scenario Prefect could restart/replace worker pods. But that’s not the case for an Azure Container Instance worker you run yourself. References: - Workers overview: Workers - Azure Container worker: Azure Container worker