<@ULVA73B9P> someone doing lab automation in our o...
# ask-marvin
s
@Marvin someone doing lab automation in our organization has workflows involving many pieces of hardware that require high uptime/robustness/tolerance (it's cell growth over weeks and months, and you want to keep the cells from getting sick or dying so you have to do things like feed it). Internet outages of a few seconds are fine, outages of a few minutes could be difficult, and outages on the order of hours or days may be catastrophic to the experiment ($1000s and needing to start from scratch). Right now, for this specific lab, Prefect is being self-hosted locally. There needs to be central visibility and monitoring that includes six other laboratories, most of which have workflows which may be more tolerant of interruptions (i.e., because they involve crystal structures or molecules, usually). Migrating entirely to Prefect Cloud would be simple and straightforward, but this particular lab may have the highest risk with such a move. Thoughts and recommendations?
m
thought for 182.4 seconds
Great question — and a very real concern for lab automation. Assuming Prefect 3.x, here’s how I’d think about the tradeoffs and a set of patterns we’ve seen work for “don’t-miss-a-feed” environments while still giving you central visibility across labs. Key behavior to know (Prefect 3.x) - Running code continues if the control plane (Cloud or Server) is unreachable. Your workers and flows run on your infrastructure; connectivity loss mostly affects telemetry (state/logs), scheduling of new runs, and heartbeats. - Heartbeats detect zombies. By default, runs heartbeat about every ~30s and are considered “late/crashed” when heartbeats stop for roughly ~90s. You can adjust thresholds and use automations for alerts instead of auto-cancels. - Docs: Detect zombie flows (heartbeats, late/crashed) - Logs/state updates are sent asynchronously. If the API is temporarily unreachable, the runner keeps executing and will retry updates. To be extra safe, also write critical logs locally (file/syslog) so you have a definitive record even during long outages. - Logging API: Logging API reference - Prefect 3 architecture overview: Prefect 3 GA blog Recommendations My advice is a hybrid approach: keep execution local, use Cloud for central visibility and coordination across labs, and add “belt-and-suspenders” safeguards for the one critical lab with tight RTO/RPO needs. 1) Use Cloud for central visibility; keep execution local in all labs - Put each lab in its own work pool; run dedicated local workers on redundant hosts at each site. Cloud gives you shared dashboards, RBAC, and cross-lab monitoring. - For the critical lab, run workers under a supervisor (systemd, Docker restart=always, k8s) with UPS power. Consider a hot-standby worker on a second box. - Set worker and run-level automations to alert on missed heartbeats/“late” runs, but avoid automatic cancellations for this lab. - Zombie detection/automations: Docs 2) Don’t rely solely on Cloud timers for life-critical actions - Schedules in Cloud can’t trigger while offline. For actions with hard SLAs (e.g., feed every 6 hours), run a “service flow” that loops locally and performs the action on schedule even when the control plane is down. Use Cloud for observability when connectivity returns. - Skeleton pattern:
Copy code
from datetime import timedelta, datetime
  from prefect import flow, get_run_logger
  import time

  @flow(name="cell-feeder-service")
  def feeder_service(feed_every_minutes: int = 360):
      logger = get_run_logger()
      while True:
          start = datetime.utcnow()
          try:
              perform_feed()  # your device I/O with robust retries/timeouts
              <http://logger.info|logger.info>("Feed completed")
          except Exception as e:
              logger.exception(f"Feed failed; entering local safe mode: {e}")
              engage_local_safe_mode()  # fallback that keeps cells viable
          # sleep locally; no dependency on Cloud scheduling
          while (datetime.utcnow() - start) < timedelta(minutes=feed_every_minutes):
              time.sleep(30)
- This pattern ensures life-critical operations continue without Cloud. When connectivity is back, logs/state updates resume; still write a local log file for ground truth. 3) Add local durability and independent monitoring - Dual logging: Prefect run logger + a local rotating file handler. If the network is out for hours, you still have full operational logs. - Persist results/metadata to a local datastore/NAS in addition to Prefect result storage, so operators can audit actions offline. - Add a local watchdog (simple process or service) that checks “time since last feed” and pages locally (SMS gateway, on-prem NMS) if thresholds are crossed.
4) Tune “zombie” detection and alerts for the critical lab - Increase tolerance before marking runs late/crashed if you expect occasional WAN blips; use Automations to declare incidents/notify instead of cancel. - Start with “late at ~3–5 min” and “crashed alert/incident at ~10–15 min,” then adjust after dry runs. See docs: Detect zombie flows 5) Plan for control-plane outages that last hours - If your organization can: add WAN redundancy (dual ISP + LTE/5G failover router) for the critical lab. It’s often the simplest way to make Cloud a non-issue. - If you need on-prem failover: keep your current self-hosted Prefect Server instance as a cold standby for the critical lab. Document a runbook to switch the worker’s
PREFECT_API_URL
profile if Cloud is unavailable beyond N minutes. Practice it quarterly. - Caveat: There’s no built-in “automatic dual control plane” failover. Treat this as a manual emergency procedure for the critical lab only. 6) Keep deployments targeted and safe - Use a dedicated work pool per lab and carefully scoped work queues, so Cloud only dispatches the critical-lab work to local workers there. - For truly safety-critical device actions, encapsulate the action in one flow/task with internal retries/timeouts and clearly defined safe-mode fallbacks. 7) Migration strategy that de-risks the critical lab - Phase 1: Move the six tolerant labs to Cloud with local workers. Leave the critical lab on self-hosted for now. - Phase 2: In the critical lab, continue to run locally, but mirror deployments to Cloud and run a “read-only” service flow in Cloud for observability. Add the service-flow pattern locally so life-critical actions do not depend on Cloud. - Phase 3: After hardening network failover and validating automations/logging, switch the critical lab’s control plane to Cloud. Keep the self-hosted instance and profile as a documented emergency fallback. What this gets you - Central monitoring and scheduling across six labs in Cloud. - The critical lab is protected by local execution, “service flows” that don’t depend on Cloud timers, independent local logging, and optional on-prem failover. - Operators get actionable alerts for truly risky conditions, not just transient connectivity noise. Open questions to tailor the plan - Are you already on Prefect 3.x? If not, we’ll plan that upgrade first. - What’s the current WAN resiliency at the critical lab (dual ISP, LTE/5G failover)? - Do you require real-time central visibility during an outage, or is near-real-time (on reconnect) acceptable? - Are there compliance constraints that affect whether cellular failover is allowed? - Would you want a documented manual failover to your existing self-hosted Server for this one lab? If you’d like, I can sketch concrete deployment/work-pool layouts and example automations for the critical lab once I know the above.
s
@Marvin could you create a discussion based on this?
m
thought for 16.3 seconds
I can do that! Before I create a GitHub Discussion, I need a bit more info to make sure it’s useful and not duplicating existing threads. Could you share: - Proposed title - 3–6 sentence summary of the problem/insight and the conclusion (if any) - Prefect version (assume 3.x unless you say otherwise), environment (Cloud/Self-hosted), worker type/infra - Minimal repro steps and any key code/log snippets (sanitize secrets) - What you tried and the current status - Any relevant links (repo, docs, run, logs) - Your preferred attribution (GitHub handle or anonymous) - Which repo and category to post under (default is PrefectHQ/prefect Discussions — categories include Q&A, Ideas, Show and tell) I’ll first search existing discussions/issues to ensure it’s not already covered, then post. If you’d like me to draft the post for your review before publishing, say “draft it” and share the details above. If you want me to summarize “this thread” directly, please confirm I should use the current Slack context and how you’d like to be credited.