Tom Han
10/01/2025, 4:51 PMMarvin
10/01/2025, 4:51 PMMarvin
10/01/2025, 4:59 PMprefect work-pool get-default-base-job-template --type kubernetes > k8s-template.json
In that JSON, add GPU requests/limits under the primary container (typical k8s path):
spec.template.spec.containers[0].resources.requests["<http://nvidia.com/gpu|nvidia.com/gpu>"] = "1"
spec.template.spec.containers[0].resources.limits["<http://nvidia.com/gpu|nvidia.com/gpu>"] = "1"
You can also add nodeSelector/tolerations to target GPU nodes as your cluster requires.
- Create the pool and start a worker (install k8s extra):
prefect work-pool create "runai-k8s" --type kubernetes --base-job-template k8s-template.json
uv run --with "prefect[kubernetes]" prefect worker start --pool runai-k8s
- Docs
- Kubernetes work pools how-to
- Manage work pools
- Developing a custom worker (handy if you want a Run:AI-CRD-based worker)
Option 2: Process worker(s) inside Run:AI pods
- Pros
- Easiest if you don’t have kubeconfig or API permissions — no Jobs are created; flows run as subprocesses in the same pod.
- Good when you want to pin long-running work to dedicated GPU pods and scale by adding more pods.
- Cons
- No per-run isolation; all flow runs share the same container environment and resources.
- Package/env conflicts between flows; memory leaks from one run can affect others.
- Scaling is coarse: to add capacity, launch more worker pods.
- Pattern
- Launch N Run:AI pods (each with GPUs as needed), each starts:
prefect worker start --pool my-process-pool
- Create a Process work pool (no extras needed) and point deployments at it.
- When to choose
- You cannot get RBAC to create Jobs.
- You’re comfortable running everything inside a fixed training image.
Option 3: Docker work pool/worker
- Pros
- Great on plain VMs with Docker; reproducible environments via images.
- Cons
- Not a good fit for Kubernetes clusters unless you have privileged access to a Docker daemon on the node (rare/unrecommended), and many clusters use containerd instead.
- When to chooseMarvin
10/01/2025, 4:59 PMuv run --with "prefect[docker]" prefect worker start --pool my-docker-pool
Run:AI-specific considerations
- Some orgs route GPU quotas and scheduling through Run:AI CRDs (not plain k8s Jobs). If your cluster enforces this, submitting vanilla Kubernetes Jobs may be restricted or may not count against quotas. Check with your platform team.
- If you must use Run:AI CRDs to launch training pods, consider building a custom Prefect worker that submits Run:AI jobs (see custom worker docs above).
Operational notes and commands
- Create a work pool and inspect the default template:
prefect work-pool create "runai-k8s" --type kubernetes --paused
prefect work-pool get-default-base-job-template --type kubernetes > k8s-template.json
- For GPU customization, edit k8s-template.json in the spec.template.spec.containers[0].resources section. You can also set:
- serviceAccountName
- nodeSelector / tolerations
- volumes/volumeMounts
- image, env, etc.
- Start a worker
- Kubernetes:
uv run --with "prefect[kubernetes]" prefect worker start --pool runai-k8s
- Process:
prefect work-pool create "runai-process" --type process
prefect worker start --pool runai-process
- Deploy flows (Prefect 3.x)
- Use flow.from_source(...).deploy(...) (don’t use the removed 2.x build command). You can override job variables at deploy time to tweak things like CPU/memory/GPU per deployment.
Helpful docs
- Manage work pools
- Kubernetes work pools how-to
- Developing a custom worker
A few clarifying questions to steer you to the best setup
1) Can you run a long-lived pod on Run:AI and install Prefect + your training deps there? If yes, Process workers are a quick win.
2) Can your team grant a ServiceAccount in your namespace with RBAC to create/list/watch/delete Jobs/Pods? If yes, Kubernetes workers are likely the best fit.
3) Does your org require using Run:AI CRDs for GPU quota enforcement? If yes, we can discuss a custom worker approach.
If you can share your namespace constraints and whether Jobs are allowed, I can sketch the exact base job template and RBAC you’ll need (including a minimal Role/RoleBinding).Tom Han
10/01/2025, 5:15 PMMarvin
10/01/2025, 5:15 PMMarvin
10/01/2025, 5:19 PMrun_deployment(...) before exiting.
Example sketch:
import signal
from prefect import flow, task, get_run_logger
from prefect.deployments.flow_runs import run_deployment
stop_requested = False
def _sigterm_handler(signum, frame):
# keep this fast; just set a flag
global stop_requested
stop_requested = True
signal.signal(signal.SIGTERM, _sigterm_handler)
@task(retries=3, retry_delay_seconds=300)
def long_training(...):
logger = get_run_logger()
# load last checkpoint if present
while work_remaining():
do_some_work_chunk()
# periodically check for preemption
if stop_requested:
logger.warning("Preemption detected; checkpointing and exiting for retry.")
save_checkpoint()
# Optionally trigger a different deployment for recovery
# run_deployment(name="project/recovery-deployment", parameters={"resume_from": current_step}, as_subflow=False)
raise RuntimeError("Preempted; retrying from checkpoint")
@flow
def train_flow(...):
long_training(...)
Notes
- Keep the signal handler trivial; do real work in the main loop. If your loop can’t check frequently, consider adding cooperative checkpoints between epochs/steps.
- Prefect task/flow retries only apply to Failed/Retrying states; not Crashed. The pattern above turns preemption into a Failure so retries apply.
- Retries how-to: Prefect 3 retries guide
Option B — Use an Automation to re-run on Crashed
- Create an Automation in the UI that triggers when a flow run for your deployment enters Crashed, and action “Run deployment” for a recovery deployment (optionally with parameters to resume from a checkpoint).
- This avoids in-process signal handling and works even if the pod dies too fast for your code to run.
Option C — Wrap the worker with a tiny supervisor
- If you really want “on SIGTERM, trigger another flow run,” you can run the worker under a small parent process that traps SIGTERM, fires run_deployment(...), then forwards the signal/terminates the worker.
- Caveat: You still need enough termination grace to make that API call.
Sketch:
# worker_supervisor.py
import signal, subprocess, sys
from prefect.deployments.flow_runs import run_deployment
child = subprocess.Popen(["prefect", "worker", "start", "--pool", "my-process-pool"])
def handle(sig, frame):
try:
run_deployment(name="project/recovery-deployment", as_subflow=False)
finally:
child.terminate()
signal.signal(signal.SIGTERM, handle)
rc = child.wait()
sys.exit(rc)
Reality check and next steps
- If your pod is killed quickly (small terminationGracePeriodSeconds or abrupt node loss), there may be no time for any in-process rescue. In that case, Option B (Automation) is the most reliable.
- The most robust approach for multi-hour work is to add frequent checkpoints and cooperative preemption handling (Option A), so a retry (or recovery flow) can resume quickly.
- If you want, share a bit about your long task shape (can it checkpoint every N minutes/steps?), and I can sketch a more concrete pattern for the checkpoint + retry loop.Tom Han
10/01/2025, 6:03 PMMarvin
10/01/2025, 6:03 PMMarvin
10/01/2025, 6:04 PMFROM python:3.11-slim
RUN apt-get update && apt-get install -y tini && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini","--"]
CMD ["prefect","worker","start","--pool","runai-process"]
Tip: If you use a shell entrypoint script, end with exec prefect worker start ... so the Prefect process becomes PID 1 under tini.
- Kubernetes Pod/Deployment (container spec)
command: ["/usr/bin/tini","--","prefect","worker","start","--pool","runai-process"]
# optionally add "-g" to tini to forward to the whole process group:
# command: ["/usr/bin/tini","-g","--","prefect","worker","start","--pool","runai-process"]
terminationGracePeriodSeconds: 120
The -g flag (optional) tells tini to send signals to the entire process group, which can help if your workload spawns deep subprocess trees.
Notes
- Avoid wrapping Prefect in sh -c unless you use `exec`; shells can swallow/alter signal propagation.
- tini doesn’t guarantee your code runs on shutdown; it just ensures the signal arrives. For “graceful retry + shutdown,” keep the cooperative checkpoint pattern in your flow/tasks.
- Retries don’t run on Crashed state; use cooperative shutdown to fail normally (so retries apply) or attach an Automation to re-run on Crashed. Retries guide: Prefect 3 retries.
If you share your current container command/entrypoint, I can suggest the exact tini invocation for your image.Tom Han
10/01/2025, 6:26 PM#!/usr/bin/env sh
set -e
# Derive worker name from env with sensible defaults:
# 1) PREFECT_WORKER_NAME (explicit override)
# 2) POD_NAME (Kubernetes downward API)
# 3) HOSTNAME (Kubernetes default)
WORKER_NAME="${PREFECT_WORKER_NAME:-${POD_NAME:-$HOSTNAME}}"
# Allow configuring pool and concurrency limit via env, with defaults
POOL="${PREFECT_WORK_POOL:-runai-test}"
LIMIT="${PREFECT_WORKER_LIMIT:-1}"
exec uv run --python "/opt/venv/bin/python" \
prefect worker start \
--pool "$POOL" \
--limit "$LIMIT" \
--name "$WORKER_NAME" \
"$@"
My dockerfile:
...
# Entrypoint: starts Prefect worker; names it after the pod by default
ENTRYPOINT ["/usr/bin/tini", "--", "/usr/local/bin/prefect-entrypoint.sh"]
Would this be fineTom Han
10/01/2025, 6:41 PM#!/usr/bin/env sh
set -e
# Derive worker name from env with sensible defaults:
# 1) PREFECT_WORKER_NAME (explicit override)
# 2) POD_NAME (Kubernetes downward API)
# 3) HOSTNAME (Kubernetes default)
WORKER_NAME="${PREFECT_WORKER_NAME:-${POD_NAME:-$HOSTNAME}}"
# Allow configuring pool and concurrency limit via env, with defaults
POOL="${PREFECT_WORK_POOL:-runai-test}"
LIMIT="${PREFECT_WORKER_LIMIT:-1}"
exec uv run --python "/opt/venv/bin/python" \
prefect worker start \
--pool "$POOL" \
--limit "$LIMIT" \
--name "$WORKER_NAME" \
"$@"
My dockerfile:
...
# Entrypoint: starts Prefect worker; names it after the pod by default
ENTRYPOINT ["/usr/bin/tini", "--", "/usr/local/bin/prefect-entrypoint.sh"]
Would this be fineMarvin
10/01/2025, 6:41 PMMarvin
10/01/2025, 6:42 PMexec in the entrypoint so signals should reach the Prefect worker cleanly.
A few small tweaks I’d recommend:
- Prefer running Prefect directly instead of via uv run in the final exec. It removes an extra process layer and avoids any uncertainty about signal forwarding. For example:
exec /opt/venv/bin/prefect worker start \
--pool "$POOL" \
--limit "$LIMIT" \
--name "$WORKER_NAME" \
"$@"
or
exec /opt/venv/bin/python -m prefect worker start \
--pool "$POOL" \
--limit "$LIMIT" \
--name "$WORKER_NAME" \
"$@"
If you keep uv run, it will likely work, but the direct call is the most predictable for signal handling.
- Consider tini -g in the Dockerfile to forward signals to the whole process group, which helps if your workload spawns multiple subprocesses:
ENTRYPOINT ["/usr/bin/tini", "-g", "--", "/usr/local/bin/prefect-entrypoint.sh"]
- Keep a generous terminationGracePeriodSeconds on the Pod so the worker and any long-running subprocesses have time to checkpoint and exit.
- Optional niceties:
- Allow targeting a specific work queue via env:
if [ -n "${PREFECT_WORK_QUEUE:-}" ]; then
set -- --work-queue "$PREFECT_WORK_QUEUE" "$@"
fi
- Make sure PREFECT_API_URL and PREFECT_API_KEY are present in the environment.
Your current script structure, env-derived worker name, and use of exec are all good. If you want, I can share a minimal “direct call” entrypoint with the optional work-queue bit included.