Sergio Luceno
10/15/2025, 6:30 PMimport time
from prefect import flow
@flow
def hello():
time.sleep(10)
print("hello world")
if __name__ == "__main__":
hello()
I've created a Docker image to hold this flow, that starts from a pythong:3.11-slim, the only dependency we install is prefect library in order to decorate with @flow
# Use Python 3.11 slim image for smaller size
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Declare build arg
ARG FLOW_NAME
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
# Install Prefect first
RUN pip install --no-cache-dir prefect
# Copy requirements file
COPY src/flows/${FLOW_NAME}/requirements.txt src/flows/${FLOW_NAME}/
# Install Prefect and other dependencies
RUN pip install --no-cache-dir -r src/flows/${FLOW_NAME}/requirements.txt
# Copy flow code and related files
COPY src/helpers src/helpers
COPY src/flows/${FLOW_NAME} src/flows/${FLOW_NAME}
# Create non-root user for security
RUN useradd -m -u 1000 prefect && \
chown -R prefect:prefect /app
# Switch to non-root user
USER prefect
CMD ["python", "-m", "src/flows/${FLOW_NAME}/main.py"]
We have a kubernetes workpool and in the deployment we override the job to pass our docker image.
When running the deployment it generates a new pod, as you can see our current code just executes a time.sleep and prints hello world, nothing fancy, nothing resource consuming.
Every each of those pods, takes 160-200MB of ram to to the sleep and printing.
With prefect we aim to orchestrate thousands of executings that only calls an http endpoint and does nothing more.
Let's say, at some point we will need to run 100000 runs, that all of them are 200MB of ram. I would need 20000000 of ram ready which is not manageable for us....
Is there something I am doing wrong? Probably prefect it is not a solution for our use case?Marvin
10/15/2025, 6:30 PMMarvin
10/15/2025, 6:36 PMprefect work-pool set-concurrency-limit "YOUR-K8S-POOL" 50
Docs: CLI: work-pool, Kubernetes work pools
2) Run many HTTP calls inside one flow run
Instead of 1 flow run = 1 HTTP call, use a single flow run (or a small number of runs) to fan out many calls concurrently. This keeps memory to a few hundred Mi per run while driving very high throughput.
- Threaded example using a task runner and streaming results:
import httpx
from prefect import flow, task
from prefect.futures import as_completed
from prefect.task_runners import ThreadPoolTaskRunner
@task
def call_endpoint(url: str, timeout: float = 10.0) -> int:
with httpx.Client(timeout=timeout) as client:
r = client.get(url)
return r.status_code
@flow(task_runner=ThreadPoolTaskRunner(max_workers=200))
def bulk_http(urls: list[str]) -> list[int]:
futures = [call_endpoint.submit(u) for u in urls]
results = []
for fut in as_completed(futures):
results.append(fut.result())
return results
- Async example (efficient for I/O-bound calls) with concurrency caps:
import asyncio
import httpx
from prefect import flow, task
@task
async def call_endpoint_async(client: httpx.AsyncClient, url: str) -> int:
r = await client.get(url, timeout=10.0)
return r.status_code
@flow
async def bulk_http_async(urls: list[str], concurrency: int = 200) -> list[int]:
sem = asyncio.Semaphore(concurrency)
limits = httpx.Limits(max_connections=concurrency)
async with httpx.AsyncClient(limits=limits) as client:
async def bounded(u):
async with sem:
return await call_endpoint_async(client, u)
return await asyncio.gather(*(bounded(u) for u in urls))
3) Apply global concurrency limits to protect external systems
If multiple flows/deployments call the same API, use Prefect’s concurrency limits by tag to enforce a global cap across your entire workspace.
- Tag your task:
@task(tags=["external-api"])
def call_endpoint(...):
...
- Create a global concurrency limit:
prefect concurrency-limit create external-api 200
Docs: Global concurrency limits
4) Consider a Process worker in Kubernetes for short jobs
If your flows are extremely short-lived, running a long-lived Process worker as a single Deployment in your cluster (instead of spinning a new K8s pod per flow) can reduce Kubernetes pod churn. Each run still uses a Python process, but you avoid the overhead of creating/destroying pods thousands of times.
Docs: Workers and work pools
5) Right-size K8s resources in your job templateMarvin
10/15/2025, 6:36 PM-m with a path containing slashes. Python expects a module path with dots when using -m. If it’s working as-is, great — otherwise consider:
CMD ["python", "-m", "src.flows.${FLOW_NAME}.main"]
or
CMD ["python", "src/flows/${FLOW_NAME}/main.py"]
Where I’d start
- Decide the maximum concurrent HTTP calls you actually need (e.g., 200–1000), not how many total you need to run.
- Implement the single-run fan-out pattern (threaded or async).
- Set a work pool concurrency limit so you never launch more pods than your cluster can handle.
- Optionally add a global concurrency limit tag if multiple deployments hit the same API.
A few verified CLI references
- Set work pool concurrency:
prefect work-pool set-concurrency-limit "my-pool" 50
- Create a concurrency limit on a tag:
prefect concurrency-limit create external-api 200
- Start a worker (if you go the Process worker route):
prefect worker start --pool "my-process-pool"
Questions to help tailor this further
- Do you need all 100k calls to be simultaneous? What’s your target QPS and acceptable end-to-end latency?
- Are you open to batching the calls into a small number of high-throughput flow runs?
- Roughly how big are the responses and how long does each request take?
Links for deeper dives
- Work pools and workers: Concepts: Workers
- Kubernetes work pools and job templates: Kubernetes how-to
- CLI reference for work pools: CLI: work-pool
- Global concurrency limits: Global concurrency limits
- Scaling patterns with mapping/concurrency (background): Prefect blog: mapping and scale
Happy to sketch a concrete example with your target concurrency and endpoints if you can share those details.