<@ULVA73B9P> My work uses a Nvidia RunAI compute c...
# ask-marvin
t
@Marvin My work uses a Nvidia RunAI compute cluster that is k8s under the hood. I am thinking about how to deploy a workpool: I can either create a large training job with a lot of pods, spawn a prefect process worker into it, use the Docker Worker Pool, or the Kubernetes one. Is there any pros and cons into doing any of them? I do not have access to kubeconfig but I could try, just not sure if I can directly start pods with it.
m
thought for 462.6 seconds
Great question — there are a few viable patterns for Run:AI-backed Kubernetes, each with trade-offs. Here’s a concise comparison and guidance based on Prefect 3.x. TL;DR recommendation - If you can run a worker in the cluster with RBAC to create Jobs: use a Kubernetes work pool. This gives you per-run pods, GPU requests per-job, and the best scalability/isolation. - If you cannot get RBAC/kubeconfig: run Process workers inside long-lived Run:AI pods. It’s simple and doesn’t need API access, but there’s no per-run isolation and scaling is coarser. - Avoid Docker workers inside Kubernetes. They expect access to a Docker daemon on the host which you typically don’t have (and clusters often run containerd, not Docker). Option 1: Kubernetes work pool/worker - Pros - Per-flow-run pods with clean isolation and ephemeral environments. - Scale across the cluster; easy to set GPU per run (nvidia.com/gpu). - You can customize pod/job spec (image, volumes, SA, node selectors, tolerations, resource limits). - No kubeconfig needed if the worker runs in-cluster; it can use the pod’s service account. - Cons - Requires RBAC permission to create/list/watch/delete Jobs/Pods in your namespace. - You’ll need to coordinate with your platform team (and Run:AI policies may add constraints). - Access model - In-cluster: run the Prefect worker as a Deployment/Job in the same cluster and namespace. Bind its ServiceAccount to a Role with the necessary verbs on jobs/pods. - Out-of-cluster: use kubeconfig and network access to the API server. - GPU requests in the base job template - Fetch the default template, edit, and upload:
Copy code
prefect work-pool get-default-base-job-template --type kubernetes > k8s-template.json
In that JSON, add GPU requests/limits under the primary container (typical k8s path):
Copy code
spec.template.spec.containers[0].resources.requests["<http://nvidia.com/gpu|nvidia.com/gpu>"] = "1"
    spec.template.spec.containers[0].resources.limits["<http://nvidia.com/gpu|nvidia.com/gpu>"] = "1"
You can also add nodeSelector/tolerations to target GPU nodes as your cluster requires. - Create the pool and start a worker (install k8s extra):
Copy code
prefect work-pool create "runai-k8s" --type kubernetes --base-job-template k8s-template.json
    uv run --with "prefect[kubernetes]" prefect worker start --pool runai-k8s
- Docs - Kubernetes work pools how-to - Manage work pools - Developing a custom worker (handy if you want a Run:AI-CRD-based worker) Option 2: Process worker(s) inside Run:AI pods - Pros - Easiest if you don’t have kubeconfig or API permissions — no Jobs are created; flows run as subprocesses in the same pod. - Good when you want to pin long-running work to dedicated GPU pods and scale by adding more pods. - Cons - No per-run isolation; all flow runs share the same container environment and resources. - Package/env conflicts between flows; memory leaks from one run can affect others. - Scaling is coarse: to add capacity, launch more worker pods. - Pattern - Launch N Run:AI pods (each with GPUs as needed), each starts:
Copy code
prefect worker start --pool my-process-pool
- Create a Process work pool (no extras needed) and point deployments at it. - When to choose - You cannot get RBAC to create Jobs. - You’re comfortable running everything inside a fixed training image. Option 3: Docker work pool/worker - Pros - Great on plain VMs with Docker; reproducible environments via images. - Cons - Not a good fit for Kubernetes clusters unless you have privileged access to a Docker daemon on the node (rare/unrecommended), and many clusters use containerd instead. - When to choose
- Only if you have non-Kubernetes nodes that can run Docker containers for your flows. - Docs - Manage work pools (Docker worker is covered here conceptually) - Start a worker with Docker extra:
Copy code
uv run --with "prefect[docker]" prefect worker start --pool my-docker-pool
Run:AI-specific considerations - Some orgs route GPU quotas and scheduling through Run:AI CRDs (not plain k8s Jobs). If your cluster enforces this, submitting vanilla Kubernetes Jobs may be restricted or may not count against quotas. Check with your platform team. - If you must use Run:AI CRDs to launch training pods, consider building a custom Prefect worker that submits Run:AI jobs (see custom worker docs above). Operational notes and commands - Create a work pool and inspect the default template:
Copy code
prefect work-pool create "runai-k8s" --type kubernetes --paused
  prefect work-pool get-default-base-job-template --type kubernetes > k8s-template.json
- For GPU customization, edit
k8s-template.json
in the
spec.template.spec.containers[0].resources
section. You can also set: -
serviceAccountName
-
nodeSelector
/
tolerations
- volumes/volumeMounts - image, env, etc. - Start a worker - Kubernetes:
Copy code
uv run --with "prefect[kubernetes]" prefect worker start --pool runai-k8s
- Process:
Copy code
prefect work-pool create "runai-process" --type process
    prefect worker start --pool runai-process
- Deploy flows (Prefect 3.x) - Use
flow.from_source(...).deploy(...)
(don’t use the removed 2.x build command). You can override job variables at deploy time to tweak things like CPU/memory/GPU per deployment. Helpful docs - Manage work pools - Kubernetes work pools how-to - Developing a custom worker A few clarifying questions to steer you to the best setup 1) Can you run a long-lived pod on Run:AI and install Prefect + your training deps there? If yes, Process workers are a quick win. 2) Can your team grant a ServiceAccount in your namespace with RBAC to create/list/watch/delete Jobs/Pods? If yes, Kubernetes workers are likely the best fit. 3) Does your org require using Run:AI CRDs for GPU quota enforcement? If yes, we can discuss a custom worker approach. If you can share your namespace constraints and whether Jobs are allowed, I can sketch the exact base job template and RBAC you’ll need (including a minimal Role/RoleBinding).
t
@Marvin I decided to go with process worker, sometimes the system restarts certain pods, and my worker is in the process of executing a long task (that takes hours, so grace period doesn't work). It is possible for the prefect process worker (once receive a SIGTERM) to trigger a different flowrun, do some graceful retrie, and shutdown?
m
thought for 267.9 seconds
Short answer: not automatically. When a Process worker/pod gets SIGTERM mid-run, the in-flight flow run typically ends as Crashed, and Prefect’s built-in retries do not fire on Crashed. You have a few workable patterns to get “graceful retry + shutdown” behavior: Option A — Cooperative shutdown in your flow code (recommended if you can change the long task) - Goal: Convert a pod preemption into a normal Failure so Prefect retries, and/or explicitly trigger a recovery run. - How: - Catch SIGTERM and set a flag. - Make your long-running task check that flag periodically, persist a checkpoint, then raise a regular exception. The run will enter Failed and your configured retries will kick in. - If you need to trigger a different flow/deployment, call
run_deployment(...)
before exiting. Example sketch:
Copy code
import signal
from prefect import flow, task, get_run_logger
from prefect.deployments.flow_runs import run_deployment

stop_requested = False
def _sigterm_handler(signum, frame):
    # keep this fast; just set a flag
    global stop_requested
    stop_requested = True

signal.signal(signal.SIGTERM, _sigterm_handler)

@task(retries=3, retry_delay_seconds=300)
def long_training(...):
    logger = get_run_logger()
    # load last checkpoint if present

    while work_remaining():
        do_some_work_chunk()
        # periodically check for preemption
        if stop_requested:
            logger.warning("Preemption detected; checkpointing and exiting for retry.")
            save_checkpoint()
            # Optionally trigger a different deployment for recovery
            # run_deployment(name="project/recovery-deployment", parameters={"resume_from": current_step}, as_subflow=False)
            raise RuntimeError("Preempted; retrying from checkpoint")

@flow
def train_flow(...):
    long_training(...)
Notes - Keep the signal handler trivial; do real work in the main loop. If your loop can’t check frequently, consider adding cooperative checkpoints between epochs/steps. - Prefect task/flow retries only apply to Failed/Retrying states; not Crashed. The pattern above turns preemption into a Failure so retries apply. - Retries how-to: Prefect 3 retries guide Option B — Use an Automation to re-run on Crashed - Create an Automation in the UI that triggers when a flow run for your deployment enters Crashed, and action “Run deployment” for a recovery deployment (optionally with parameters to resume from a checkpoint). - This avoids in-process signal handling and works even if the pod dies too fast for your code to run. Option C — Wrap the worker with a tiny supervisor - If you really want “on SIGTERM, trigger another flow run,” you can run the worker under a small parent process that traps SIGTERM, fires
run_deployment(...)
, then forwards the signal/terminates the worker. - Caveat: You still need enough termination grace to make that API call. Sketch:
Copy code
# worker_supervisor.py
import signal, subprocess, sys
from prefect.deployments.flow_runs import run_deployment

child = subprocess.Popen(["prefect", "worker", "start", "--pool", "my-process-pool"])
def handle(sig, frame):
    try:
        run_deployment(name="project/recovery-deployment", as_subflow=False)
    finally:
        child.terminate()

signal.signal(signal.SIGTERM, handle)
rc = child.wait()
sys.exit(rc)
Reality check and next steps - If your pod is killed quickly (small terminationGracePeriodSeconds or abrupt node loss), there may be no time for any in-process rescue. In that case, Option B (Automation) is the most reliable. - The most robust approach for multi-hour work is to add frequent checkpoints and cooperative preemption handling (Option A), so a retry (or recovery flow) can resume quickly. - If you want, share a bit about your long task shape (can it checkpoint every N minutes/steps?), and I can sketch a more concrete pattern for the checkpoint + retry loop.
t
@Marvin Does prefect works with tini for signal handling?
m
thought for 68.8 seconds
Yes — Prefect works fine with tini. In fact, running your container with a real init like tini is a good idea for signal forwarding and zombie reaping. How it behaves - Kubernetes sends SIGTERM to PID 1. With tini as PID 1, that SIGTERM is forwarded to your Prefect process. - Prefect’s worker and flow engines install their own SIGTERM handlers; when they receive the signal they begin a graceful shutdown and propagate cancellation to child/subprocesses. Your flow code can then checkpoint/exit as discussed earlier. - Using tini reduces “stuck” zombies from subprocesses (e.g., long trainings, multiprocess libs) and improves reliability during shutdown. Recommended setups - Dockerfile
Copy code
FROM python:3.11-slim
  RUN apt-get update && apt-get install -y tini && rm -rf /var/lib/apt/lists/*
  ENTRYPOINT ["/usr/bin/tini","--"]
  CMD ["prefect","worker","start","--pool","runai-process"]
Tip: If you use a shell entrypoint script, end with
exec prefect worker start ...
so the Prefect process becomes PID 1 under tini. - Kubernetes Pod/Deployment (container spec)
Copy code
command: ["/usr/bin/tini","--","prefect","worker","start","--pool","runai-process"]
  # optionally add "-g" to tini to forward to the whole process group:
  # command: ["/usr/bin/tini","-g","--","prefect","worker","start","--pool","runai-process"]
  terminationGracePeriodSeconds: 120
The
-g
flag (optional) tells tini to send signals to the entire process group, which can help if your workload spawns deep subprocess trees. Notes - Avoid wrapping Prefect in
sh -c
unless you use `exec`; shells can swallow/alter signal propagation. - tini doesn’t guarantee your code runs on shutdown; it just ensures the signal arrives. For “graceful retry + shutdown,” keep the cooperative checkpoint pattern in your flow/tasks. - Retries don’t run on Crashed state; use cooperative shutdown to fail normally (so retries apply) or attach an Automation to re-run on Crashed. Retries guide: Prefect 3 retries. If you share your current container command/entrypoint, I can suggest the exact tini invocation for your image.
t
My Entrypoint:
Copy code
#!/usr/bin/env sh
set -e

# Derive worker name from env with sensible defaults:
# 1) PREFECT_WORKER_NAME (explicit override)
# 2) POD_NAME (Kubernetes downward API)
# 3) HOSTNAME (Kubernetes default)
WORKER_NAME="${PREFECT_WORKER_NAME:-${POD_NAME:-$HOSTNAME}}"

# Allow configuring pool and concurrency limit via env, with defaults
POOL="${PREFECT_WORK_POOL:-runai-test}"
LIMIT="${PREFECT_WORKER_LIMIT:-1}"

exec uv run --python "/opt/venv/bin/python" \
  prefect worker start \
  --pool "$POOL" \
  --limit "$LIMIT" \
  --name "$WORKER_NAME" \
  "$@"
My dockerfile:
Copy code
...
# Entrypoint: starts Prefect worker; names it after the pod by default
ENTRYPOINT ["/usr/bin/tini", "--", "/usr/local/bin/prefect-entrypoint.sh"]
Would this be fine
@Marvin My Entrypoint:
Copy code
#!/usr/bin/env sh
set -e

# Derive worker name from env with sensible defaults:
# 1) PREFECT_WORKER_NAME (explicit override)
# 2) POD_NAME (Kubernetes downward API)
# 3) HOSTNAME (Kubernetes default)
WORKER_NAME="${PREFECT_WORKER_NAME:-${POD_NAME:-$HOSTNAME}}"

# Allow configuring pool and concurrency limit via env, with defaults
POOL="${PREFECT_WORK_POOL:-runai-test}"
LIMIT="${PREFECT_WORKER_LIMIT:-1}"

exec uv run --python "/opt/venv/bin/python" \
  prefect worker start \
  --pool "$POOL" \
  --limit "$LIMIT" \
  --name "$WORKER_NAME" \
  "$@"
My dockerfile:
Copy code
...
# Entrypoint: starts Prefect worker; names it after the pod by default
ENTRYPOINT ["/usr/bin/tini", "--", "/usr/local/bin/prefect-entrypoint.sh"]
Would this be fine
m
thought for 55.7 seconds
Yes — this is generally solid. You’re using tini as PID 1 and
exec
in the entrypoint so signals should reach the Prefect worker cleanly. A few small tweaks I’d recommend: - Prefer running Prefect directly instead of via
uv run
in the final exec. It removes an extra process layer and avoids any uncertainty about signal forwarding. For example:
Copy code
exec /opt/venv/bin/prefect worker start \
    --pool "$POOL" \
    --limit "$LIMIT" \
    --name "$WORKER_NAME" \
    "$@"
or
Copy code
exec /opt/venv/bin/python -m prefect worker start \
    --pool "$POOL" \
    --limit "$LIMIT" \
    --name "$WORKER_NAME" \
    "$@"
If you keep
uv run
, it will likely work, but the direct call is the most predictable for signal handling. - Consider
tini -g
in the Dockerfile to forward signals to the whole process group, which helps if your workload spawns multiple subprocesses:
Copy code
ENTRYPOINT ["/usr/bin/tini", "-g", "--", "/usr/local/bin/prefect-entrypoint.sh"]
- Keep a generous
terminationGracePeriodSeconds
on the Pod so the worker and any long-running subprocesses have time to checkpoint and exit. - Optional niceties: - Allow targeting a specific work queue via env:
Copy code
if [ -n "${PREFECT_WORK_QUEUE:-}" ]; then
      set -- --work-queue "$PREFECT_WORK_QUEUE" "$@"
    fi
- Make sure
PREFECT_API_URL
and
PREFECT_API_KEY
are present in the environment. Your current script structure, env-derived worker name, and use of
exec
are all good. If you want, I can share a minimal “direct call” entrypoint with the optional work-queue bit included.