<@ULVA73B9P> I am running prefect through helm on ...
# ask-marvin
h
@Marvin I am running prefect through helm on kubernetes and some of my pods are getting OOMKilled by the cluster. I would like to increase memory limits. As far as i’ve understood this can be done by changing the BaseJobTemplate of the Kubernetes worker. Is that correct? How would i set some higher memory usage and also a limit?
m
thought for 199.0 seconds
Short answer: Yes—if the pods being OOMKilled are your flow run pods created by the Kubernetes worker, you should raise memory/CPU requests/limits in the Kubernetes work pool’s Base Job Template. If the OOMs are happening on the worker pod itself (or other Helm-managed Prefect pods), you should raise the Helm chart’s container resources for that component instead. A few quick questions to confirm so we target the right fix: - Are the OOMKilled pods the short-lived flow run Job pods (often named like prefect-job-...) or the long-running worker pod from your Helm release? - Which namespace/pod names are impacted? If it’s flow run jobs (most common): 1) Export the default base job template for the Kubernetes work pool:
Copy code
prefect work-pool get-default-base-job-template --type kubernetes -f k8s-job-template.json
2) Edit k8s-job-template.json and add resources under the container in spec.template.spec.containers[0].resources. For example:
Copy code
{
  "apiVersion": "batch/v1",
  "kind": "Job",
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "name": "flow-runner",
            "resources": {
              "requests": {
                "cpu": "500m",
                "memory": "1Gi"
              },
              "limits": {
                "cpu": "2",
                "memory": "2Gi"
              }
            }
          }
        ]
      }
    }
  }
}
Notes: - Keep the rest of the template content intact; just add/modify the resources block on the flow container. - Units: use “Mi”/“Gi” for memory and millicores (“m”) for CPU where appropriate. 3) Update your work pool to use the edited template:
Copy code
prefect work-pool update "<your-pool-name>" --base-job-template k8s-job-template.json
4) Run a new flow; confirm the spawned Pod has the new requests/limits:
Copy code
kubectl describe pod <pod-name> -n <ns> | grep -A5 "Limits\|Requests"
If it’s the worker pod (or any Helm-managed Prefect pod) being OOMKilled: - Increase that Deployment’s container resources in your Helm values and upgrade the release. A typical snippet looks like:
Copy code
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "2Gi"
Apply this under the component section for the worker (or the specific service) in your values file, then:
Copy code
helm upgrade <release> <chart> -n <ns> -f values.yaml
Helpful docs: - Kubernetes worker + job template guidance: Prefect 3.x: Kubernetes deployments and workers - Managing work pools (including base job templates): Prefect 3.x: Manage work pools If you can share the exact pod names/namespaces that are OOMKilled, I can point you to the precise place to change resources (job template vs Helm values) and provide a tailored example.
h
@Marvin what happens when this is not specified? would the flows just try to use as much Memory as they can, depening on what’s available?
m
thought for 66.2 seconds
Great question. Short answer: if you don’t set resources in the Kubernetes job template, Kubernetes treats your flow-run container as BestEffort (unless your namespace has a LimitRange that injects defaults). That means: - No requests: the scheduler reserves 0 CPU/memory for the pod. It can land anywhere and is first to be evicted under pressure. - No limits: there’s no cgroup cap on memory/CPU. The container can use as much as is available on the node. If the node runs out of memory, the kernel may kill the container (you may see the container terminated with reason OOMKilled) or the pod may be evicted. A few nuances: - If your namespace has a LimitRange, it may automatically set default requests/limits even if you don’t specify them in the template. In that case, exceeding the memory limit will cause the container to be OOMKilled. - QoS classes: BestEffort (no requests/limits), Burstable (requests set but not equal to limits), Guaranteed (requests = limits for cpu and memory). Guaranteed pods are last to be evicted. How to get predictable behavior - Set both requests and limits in the work pool’s Base Job Template for the Kubernetes worker. - If you want Guaranteed QoS, set requests equal to limits for both cpu and memory. Example resources to add to your base job template container:
Copy code
"resources": {
  "requests": {
    "cpu": "500m",
    "memory": "1Gi"
  },
  "limits": {
    "cpu": "500m",
    "memory": "1Gi"
  }
}
Or allow some burst while capping memory:
Copy code
"resources": {
  "requests": {
    "cpu": "500m",
    "memory": "1Gi"
  },
  "limits": {
    "cpu": "2",
    "memory": "2Gi"
  }
}
Useful checks - See what your pod actually had:
Copy code
kubectl describe pod <pod> -n <ns> | grep -A5 "Limits\|Requests"
- Check for namespace defaults:
Copy code
kubectl get limitrange -n <ns>
kubectl describe limitrange -n <ns>
Docs and how-tos - Prefect 3.x Kubernetes workers and templates: Kubernetes deployments and workers - Managing work pools (including base job templates): Manage work pools If you want, share a
kubectl describe
of one of the OOMKilled pods and your namespace LimitRange (if any) and I can suggest exact request/limit values.
h
@Marvin i am getting this more often as i am now scheduling around 30-40 deployments at the same time through the same worker. To me it makes sense that when all of them are run at like 8pm eventually some don’t have enough ressources and get OOMkilled. Would it also make sense to set concurrency limits to make sure they are not all run in parallel?
m
thought for 182.8 seconds
Yes, setting a work pool concurrency limit is a great way to avoid a “stampede” of 30–40 flow runs starting at once and contending for cluster memory. What it does - Work pool concurrency limits cap how many flow runs from that pool can run at the same time. For Kubernetes workers, that directly limits how many Job/Pods the worker creates concurrently. - Remaining runs stay in a PENDING state and are picked up as slots free up. - This complements setting proper CPU/memory requests/limits in your Base Job Template; both together reduce OOMs. How to set it
Copy code
prefect work-pool set-concurrency-limit "<your-pool-name>" 10
- Pick a number that fits your cluster capacity and each run’s resource requests. For example, if each run requests ~1Gi and you can comfortably run 10 in parallel, set it to 10. - To remove it later:
Copy code
prefect work-pool clear-concurrency-limit "<your-pool-name>"
Related notes - Tag-based concurrency limits (via
prefect concurrency-limit ...
) throttle task runs by tag, not flow runs. They’re useful inside flows but won’t stop the worker from creating Pods in the first place. - If all deployments are scheduled for the same minute, consider staggering their schedules slightly to reduce burstiness. - Still set CPU/memory requests/limits in the Kubernetes Base Job Template so each Pod gets reserved resources and has a memory cap that prevents node-wide pressure. Docs - Work pools and concurrency: Manage work pools (Prefect 3.x) - Kubernetes worker and job templates: Kubernetes deployments and workers (Prefect 3.x) If you share your cluster size and typical per-run memory needs, I can help estimate a good starting concurrency limit.
h
@Marvin I have now set a limit of 10, which should be okay. Can i somehow trigger all deployments that have a “kl_” in their name through the CLI? I would like to test this and not go through the UI one by one
m
thought for 137.1 seconds
Yes, that’s a good approach. There isn’t a built-in CLI filter-by-pattern flag, but you can combine listing + shell filtering to trigger them in bulk. Bash/Zsh (Linux/macOS) - Preview (dry run) the commands you’ll execute:
Copy code
prefect deployment ls | awk '$1 ~ /\/kl_/ {print $1}' | xargs -n1 -I {} echo prefect deployment run "{}"
- Execute:
Copy code
prefect deployment ls | awk '$1 ~ /\/kl_/ {print $1}' | xargs -n1 -I {} prefect deployment run "{}"
This looks for deployment names with “/kl_” (since the NAME column is flow/deployment) and runs each. PowerShell (Windows) - Preview:
Copy code
prefect deployment ls | Select-String '/kl_' | ForEach-Object { ($_ -split '\s+')[0] } | ForEach-Object { "prefect deployment run $($_)" }
- Execute:
Copy code
prefect deployment ls | Select-String '/kl_' | ForEach-Object { ($_ -split '\s+')[0] } | ForEach-Object { prefect deployment run $_ }
Notes - Your work pool concurrency limit of 10 will ensure only 10 flow runs execute at once; the rest will queue. - If you want to stagger starts slightly, you can add a small sleep in the loop (e.g.,
sh -c 'prefect deployment run "{}"; sleep 0.5'
inside xargs). Docs - Prefect 3.x CLI: deployment - Run deployments
h
@Marvin what’s currently the best way to setup teams notifications when a flow failed?
m
thought for 84.8 seconds
Great question. It depends a bit on whether you’re using Prefect Cloud or self-hosted Prefect 3. First, are you on Prefect Cloud or a self-hosted Prefect Server? I’ll outline both: Option A — Prefect Cloud (recommended for this use case) Use an Automation that calls a Teams incoming webhook when a flow run fails. Steps: 1) Create a Teams incoming webhook - In the Teams channel, add the “Incoming Webhook” connector and copy the URL. - Docs: Add an incoming webhook (Microsoft Teams) 2) Create a Prefect Automation (UI) - Trigger: Flow run enters state Failed (you can optionally include Crashed). - Optional filters: specific deployments, tags, or work pool. - Action: Call a webhook - URL: your Teams webhook URL - Method: POST - Headers: Content-Type: application/json - Body: a Teams card payload, e.g.:
Copy code
{
  "@type": "MessageCard",
  "@context": "<http://schema.org/extensions>",
  "themeColor": "E81123",
  "summary": "Prefect flow run failed",
  "title": "Flow run failed",
  "sections": [
    {
      "activityTitle": "A flow run failed",
      "facts": [
        {
          "name": "Flow",
          "value": "{{ event.payload.flow_run.name }}"
        },
        {
          "name": "Deployment",
          "value": "{{ event.payload.deployment.name }}"
        },
        {
          "name": "Run ID",
          "value": "{{ event.payload.flow_run.id }}"
        },
        {
          "name": "State",
          "value": "{{ event.payload.state.name }}"
        },
        {
          "name": "Time (UTC)",
          "value": "{{ event.time }}"
        }
      ]
    }
  ],
  "potentialAction": [
    {
      "@type": "OpenUri",
      "name": "Open in Prefect",
      "targets": [
        {
          "os": "default",
          "uri": "{{ event.payload.flow_run.url }}"
        }
      ]
    }
  ]
}
Notes: - The example uses templating placeholders typical for Prefect Cloud automations. If you prefer, you can start with static text and add fields incrementally. - Docs: Prefect Cloud: Create a webhook (general webhook how-to; the Automation UI uses the same principles) Option B — Self-hosted (no Automations) Post to Teams from your flow when it fails. A simple, reliable pattern is to catch exceptions at the top of your flow, send the Teams webhook, then re-raise so the run is still marked failed. Example:
Copy code
import os
import httpx
from prefect import flow

TEAMS_WEBHOOK_URL = os.environ["TEAMS_WEBHOOK_URL"]

def send_teams_card(title: str, text: str, link: str = ""):
  card = {
    "@type": "MessageCard",
    "@context": "<http://schema.org/extensions>",
    "themeColor": "E81123",
    "summary": title,
    "title": title,
    "sections": [{ "text": text }],
  }
  if link:
    card["potentialAction"] = [{
      "@type": "OpenUri",
      "name": "Open in Prefect",
      "targets": [{ "os": "default", "uri": link }]
    }]
  <http://httpx.post|httpx.post>(TEAMS_WEBHOOK_URL, json=card, timeout=10).raise_for_status()

@flow
def my_flow():
  try:
    # your flow logic here
    ...
  except Exception as exc:
    # optionally construct a Prefect URL to the flow run if you store it in context
    send_teams_card(
      title="Prefect flow run failed",
      text=f"my_flow failed with: {exc}"
    )
    raise
- You can enrich the message with run IDs and links if you capture them (e.g., via runtime context or logging). - Teams webhook doc: Teams incoming webhook Which path do you want to take? If you’re on Cloud, I can give you the exact event fields to include in the Automation payload. If you’re self-hosted, I can help wire up a lightweight helper that posts rich cards (and include a link to the run in your Prefect UI).
h
@Marvin i am using prefect on premise via kubernetes helm charts.
m
thought for 224.2 seconds
Got it — for self-hosted Prefect (OSS) via Helm, the most reliable way is to send a Teams incoming webhook from your flow when it fails. Prefect Cloud Automations aren’t available in OSS, so the recommended pattern is a flow state hook. What to do 1) Create a Teams incoming webhook - In the Teams channel, add the “Incoming Webhook” connector and copy the URL. - Docs: Teams incoming webhook 2) Make the webhook URL available to your flow-run pods - Best practice: store the URL in a K8s Secret and inject it into flow-run pods via your Kubernetes work pool’s Base Job Template. - Add an env var to the container in your base job template:
Copy code
"spec": {
  "template": {
    "spec": {
      "containers": [
        {
          "name": "flow-runner",
          "env": [
            {
              "name": "TEAMS_WEBHOOK_URL",
              "valueFrom": {
                "secretKeyRef": { "name": "teams-webhook", "key": "url" }
              }
            },
            {
              "name": "PREFECT_UI_URL",
              "value": "<https://your-prefect-ui.example.com>"
            }
          ]
        }
      ]
    }
  }
}
3) Add a flow failure hook to post to Teams - The hook receives (flow, flow_run, state). Attach it with the flow decorator’s on_failure/on_crashed.
Copy code
import os
import httpx
from prefect import flow

def send_teams_card(title: str, text: str, link: str | None = None):
    webhook = os.environ["TEAMS_WEBHOOK_URL"]
    card = {
        "@type": "MessageCard",
        "@context": "<http://schema.org/extensions>",
        "themeColor": "E81123",
        "summary": title,
        "title": title,
        "sections": [{ "text": text }],
    }
    if link:
        card["potentialAction"] = [{
            "@type": "OpenUri",
            "name": "Open in Prefect",
            "targets": [{ "os": "default", "uri": link }]
        }]
    r = <http://httpx.post|httpx.post>(webhook, json=card, timeout=10)
    r.raise_for_status()

def notify_teams_on_failure(flow, flow_run, state):
    ui_base = os.environ.get("PREFECT_UI_URL", "").rstrip("/")
    run_url = f"{ui_base}/flow-runs/flow-run/{flow_run.id}" if ui_base else None
    title = f"Flow failed: {flow_run.name}"
    text = (
        f"- Flow: {flow.name}\n"
        f"- Deployment: {getattr(flow_run, 'deployment_id', '')}\n"
        f"- Run ID: {flow_run.id}\n"
        f"- State: {state.name}\n"
        f"- Message: {state.message or ''}"
    )
    send_teams_card(title, text, run_url)

@flow(on_failure=notify_teams_on_failure, on_crashed=notify_teams_on_failure)
def my_flow(...):
    ...
Notes and tips - You can reuse the same hook function across all flows by importing it and attaching it via the flow decorator. - If you prefer not to use hooks, you can wrap your flow body in try/except, send the Teams message, then re-raise. - To reduce noise, consider using retries and only notify after final failure. - If your Base Job Template doesn’t currently inject env vars, update it and apply to your work pool:
Copy code
prefect work-pool update "<your-k8s-pool>" --base-job-template k8s-job-template.json
Want me to draft a full Base Job Template snippet (with both resources and these env vars) tailored to your work pool? If you share your pool name and any existing template content, I’ll put one together.
h
@Marvin how is that different from using “Notification” and “Microsoft Teams Webhook” in particular as setting in the Prefect UI?
m
thought for 101.2 seconds
Short version: - If you’re on Prefect Cloud: the “Notifications” UI with “Microsoft Teams Webhook” is the built-in, no-code way. You define a Teams destination once, then create an Automation that triggers on “Flow run failed” and sends a message. Secrets are stored/managed in Cloud, you can scope by deployments/tags, and you don’t have to touch your flow code. - If you’re truly on self-hosted (OSS) via Helm: the Notifications/Automations UI is not available. In OSS you’ll need to send the Teams webhook yourself (e.g., via a flow on_failure/on_crashed hook like I shared) or run an external service that listens for run-state changes and posts to Teams. Key differences - Cloud Notifications (UI) - No code changes; one place to manage. - Built-in scoping (by deployment, tags, etc.), templates, centralized secrets. - Good for org-wide consistency and auditing. - Docs: Create a webhook (Cloud) - Cloud vs OSS: Cloud vs OSS - OSS (self-hosted) - Use flow hooks (on_failure/on_crashed) or your own webhook service. - You manage the Teams webhook secret (e.g., K8s Secret + env var). - Per-flow code, but maximum flexibility and works fully on-prem. - Teams webhook doc: Teams incoming webhook Quick clarification to confirm next steps: In your self-hosted UI, do you actually see a Notifications section where you can add a “Microsoft Teams Webhook” destination? If yes, you might be connected to a Cloud workspace; if not, we’ll stick with the OSS flow-hook approach and I can help you standardize it across flows.
h
@Marvin yet it is available in the UI? i also used this before and it worked with oss prefect? has this changed some time?
m
thought for 317.5 seconds
You’re right to call that out—thanks for the nudge. In Prefect 3.x, Automations and Notifications are available in OSS as well as Cloud. So in your self-hosted (Helm) setup, using the UI “Notification” with “Microsoft Teams Webhook” is valid and supported. What’s the difference vs the flow-hook (code) approach? - Where it runs: UI Notifications are sent by the Prefect server’s automation service; flow hooks send from your flow-run container. - Network path: With UI Notifications, only the server needs egress to Teams. With hooks, every flow-run pod needs egress to Teams. - Reliability: Automations fire on state transitions observed by the orchestrator; they don’t depend on your flow code running a callback. Hooks won’t run if the process dies before the hook executes. - Management: UI is centralized, no code changes, easy to scope by deployments/tags. Hooks are per-flow code but give you full control. If you already have Teams as a Notification destination in the UI, the “best” path on OSS is: - Create/verify a Teams incoming webhook URL in your channel: Teams incoming webhook - In Prefect UI, create a Notification destination (Microsoft Teams Webhook) with that URL. - Create an Automation: - Trigger: Flow run enters state Failed (optionally include Crashed) - Scope: all deployments or specific ones/tags - Action: Send notification -> choose your Teams destination Useful references - Blocks (incl. notification blocks like MicrosoftTeamsWebhook): Prefect 3.x Blocks - Automations (CLI exists in OSS:
prefect automation ls/inspect/pause/resume
) If you prefer the code approach for certain projects (e.g., custom cards), you can still use a flow `on_failure`/`on_crashed` hook and post to the Teams webhook URL from the flow-run pod. Want me to sanity-check your current Automation trigger/action setup or provide a templated Teams message payload to include extra run details (links, parameters, etc.)?