Hey <@ULVA73B9P> I am building a prefect in a sel...
# ask-marvin
s
Hey @Marvin I am building a prefect in a self hosted environment in a k8s cluster. I've just builded an easy deployment that executes this flow as an example:
Copy code
import time
from prefect import flow

@flow
def hello():
    time.sleep(10)
    print("hello world")

if __name__ == "__main__":
    hello()
I've created a Docker image to hold this flow, that starts from a pythong:3.11-slim, the only dependency we install is prefect library in order to decorate with @flow
Copy code
# Use Python 3.11 slim image for smaller size
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Declare build arg
ARG FLOW_NAME

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
	PYTHONUNBUFFERED=1 \
	PIP_NO_CACHE_DIR=1 \
	PIP_DISABLE_PIP_VERSION_CHECK=1

# Install Prefect first
RUN pip install --no-cache-dir prefect

# Copy requirements file
COPY src/flows/${FLOW_NAME}/requirements.txt src/flows/${FLOW_NAME}/

# Install Prefect and other dependencies
RUN pip install --no-cache-dir -r src/flows/${FLOW_NAME}/requirements.txt

# Copy flow code and related files
COPY src/helpers src/helpers
COPY src/flows/${FLOW_NAME} src/flows/${FLOW_NAME}

# Create non-root user for security
RUN useradd -m -u 1000 prefect && \
	chown -R prefect:prefect /app

# Switch to non-root user
USER prefect

CMD ["python", "-m", "src/flows/${FLOW_NAME}/main.py"]
We have a kubernetes workpool and in the deployment we override the job to pass our docker image. When running the deployment it generates a new pod, as you can see our current code just executes a time.sleep and prints hello world, nothing fancy, nothing resource consuming. Every each of those pods, takes 160-200MB of ram to to the sleep and printing. With prefect we aim to orchestrate thousands of executings that only calls an http endpoint and does nothing more. Let's say, at some point we will need to run 100000 runs, that all of them are 200MB of ram. I would need 20000000 of ram ready which is not manageable for us.... Is there something I am doing wrong? Probably prefect it is not a solution for our use case?
m
thought for 362.2 seconds
Great question — and thanks for the detailed context. Short answer - What you’re seeing (160–200Mi per flow-run pod) is normal for Python + Prefect in Kubernetes. Each run is a full Python interpreter plus the Prefect engine and deps. - Prefect is a good fit for your use case, but you won’t want 100k pods at once. The trick is to: - limit concurrent runs at the work pool level, and/or - model your workload to run many lightweight HTTP calls inside a single (or a few) flow runs using concurrency. Why each pod uses ~200Mi - A K8s work pool launches a fresh pod per flow run. That pod has a Python process, Prefect engine, http client, logging, etc. A ~160–200Mi RSS baseline is typical for that stack on python:3.11-slim. How to scale this kind of workload efficiently 1) Limit concurrent pods (work pool concurrency) Only allow as many simultaneous runs as your cluster can handle. Prefect will queue the rest and execute them as capacity frees up. - Set a work pool concurrency limit:
Copy code
prefect work-pool set-concurrency-limit "YOUR-K8S-POOL" 50
Docs: CLI: work-pool, Kubernetes work pools 2) Run many HTTP calls inside one flow run Instead of 1 flow run = 1 HTTP call, use a single flow run (or a small number of runs) to fan out many calls concurrently. This keeps memory to a few hundred Mi per run while driving very high throughput. - Threaded example using a task runner and streaming results:
Copy code
import httpx
from prefect import flow, task
from prefect.futures import as_completed
from prefect.task_runners import ThreadPoolTaskRunner

@task
def call_endpoint(url: str, timeout: float = 10.0) -> int:
    with httpx.Client(timeout=timeout) as client:
        r = client.get(url)
        return r.status_code

@flow(task_runner=ThreadPoolTaskRunner(max_workers=200))
def bulk_http(urls: list[str]) -> list[int]:
    futures = [call_endpoint.submit(u) for u in urls]
    results = []
    for fut in as_completed(futures):
        results.append(fut.result())
    return results
- Async example (efficient for I/O-bound calls) with concurrency caps:
Copy code
import asyncio
import httpx
from prefect import flow, task

@task
async def call_endpoint_async(client: httpx.AsyncClient, url: str) -> int:
    r = await client.get(url, timeout=10.0)
    return r.status_code

@flow
async def bulk_http_async(urls: list[str], concurrency: int = 200) -> list[int]:
    sem = asyncio.Semaphore(concurrency)
    limits = httpx.Limits(max_connections=concurrency)
    async with httpx.AsyncClient(limits=limits) as client:
        async def bounded(u):
            async with sem:
                return await call_endpoint_async(client, u)
        return await asyncio.gather(*(bounded(u) for u in urls))
3) Apply global concurrency limits to protect external systems If multiple flows/deployments call the same API, use Prefect’s concurrency limits by tag to enforce a global cap across your entire workspace. - Tag your task:
Copy code
@task(tags=["external-api"])
def call_endpoint(...):
    ...
- Create a global concurrency limit:
Copy code
prefect concurrency-limit create external-api 200
Docs: Global concurrency limits 4) Consider a Process worker in Kubernetes for short jobs If your flows are extremely short-lived, running a long-lived Process worker as a single Deployment in your cluster (instead of spinning a new K8s pod per flow) can reduce Kubernetes pod churn. Each run still uses a Python process, but you avoid the overhead of creating/destroying pods thousands of times. Docs: Workers and work pools 5) Right-size K8s resources in your job template
For Kubernetes work pools, set your pod requests/limits appropriately in the base job template (e.g., request 100–200Mi, limit 256–512Mi) and validate via load testing. Prefect’s K8s guide shows where these fields live in the job template. Docs: Kubernetes work pools Notes on your Dockerfile/CMD - Your CMD uses
-m
with a path containing slashes. Python expects a module path with dots when using
-m
. If it’s working as-is, great — otherwise consider:
Copy code
CMD ["python", "-m", "src.flows.${FLOW_NAME}.main"]
or
Copy code
CMD ["python", "src/flows/${FLOW_NAME}/main.py"]
Where I’d start - Decide the maximum concurrent HTTP calls you actually need (e.g., 200–1000), not how many total you need to run. - Implement the single-run fan-out pattern (threaded or async). - Set a work pool concurrency limit so you never launch more pods than your cluster can handle. - Optionally add a global concurrency limit tag if multiple deployments hit the same API. A few verified CLI references - Set work pool concurrency:
Copy code
prefect work-pool set-concurrency-limit "my-pool" 50
- Create a concurrency limit on a tag:
Copy code
prefect concurrency-limit create external-api 200
- Start a worker (if you go the Process worker route):
Copy code
prefect worker start --pool "my-process-pool"
Questions to help tailor this further - Do you need all 100k calls to be simultaneous? What’s your target QPS and acceptable end-to-end latency? - Are you open to batching the calls into a small number of high-throughput flow runs? - Roughly how big are the responses and how long does each request take? Links for deeper dives - Work pools and workers: Concepts: Workers - Kubernetes work pools and job templates: Kubernetes how-to - CLI reference for work pools: CLI: work-pool - Global concurrency limits: Global concurrency limits - Scaling patterns with mapping/concurrency (background): Prefect blog: mapping and scale Happy to sketch a concrete example with your target concurrency and endpoints if you can share those details.