<@ULVA73B9P> I moved from docker to process worker...
# ask-marvin
j
@Marvin I moved from docker to process workers. Now i have issues with concurrency of the pull steps of the deployment. the git_clone and pip_install_requirements have concurrency issues. Also i cannot wrap my head around it that i would clone the same code over and over again for each flow run, even tho the code might not change for days. ...
🤔 1
m
✅ thought for 457.2 seconds
🧐 1
j
no idea ?
m
You’re not imagining it. With process workers, every flow run executes the pull steps on the host, and if those runs share a working directory you’ll get lock/contention failures in git_clone and pip_install_requirements. And yes: out of the box, you’ll clone/install on every run unless you change the shape of the deployment. Here are practical options, from quickest fix to most efficient: 1) Isolate the working directory per run (fixes concurrency errors immediately) - Make each run operate in its own directory so git and pip don’t step on each other. - Use set_working_directory before git_clone. Templating supports runtime variables so you can use the flow run id. Example (prefect.yaml pull steps): - prefect.deployments.steps.pull.set_working_directory: directory: "/srv/prefect/runs/{{ flow_run.id }}" - prefect.deployments.steps.pull.git_clone: repository: "https://github.com/org/repo.git" branch: "main" - prefect.deployments.steps.utility.pip_install_requirements: directory: "." requirements_file: "requirements.txt" Notes: - This removes contention, but you’ll still clone/install per run and will need periodic cleanup to avoid disk bloat. - git_clone clones into the current working directory; directories is for sparse checkouts (only certain paths), not the destination. Docs: - git_clone (params): https://docs-3.prefect.io/v3/api-ref/python/prefect-deployments-steps-pull - pip_install_requirements: https://docs-3.prefect.io/v3/api-ref/python/prefect-deployments-steps-utility 2) Stop installing on every run (prebuilt virtualenv) - Create a virtualenv once on the host and point your process worker at it. Then remove pip_install_requirements from your deployment steps. - Quick approach: start the worker from that venv so it uses that Python and site-packages: python -m venv /opt/prefect/envs/myflow /opt/prefect/envs/myflow/bin/pip install -r requirements.txt /opt/prefect/envs/myflow/bin/prefect worker start -p my-pool - More formal approach: set the Python executable/command in your Process work pool’s job template (UI → Work Pools → your pool → Advanced). Have it run your venv’s python so runs inherit the environment. Docs: - Manage work pools and job templates: https://docs-3.prefect.io/v3/how-to-guides/deployment_infra/manage-work-pools 3) Avoid re-cloning full repos (use a shared cache + per-run worktrees) - Keep a bare mirror of your repo as a shared cache, update it with a lock, then create a per-run worktree. This avoids downloading the entire repo repeatedly and still gives each run its own working directory. Example using run_shell_script step: - prefect.deployments.steps.pull.set_working_directory: directory: "/srv/prefect/runs/{{ flow_run.id }}" - prefect.deployments.steps.utility.run_shell_script: script: | set -euo pipefail BASE="/srv/prefect/cache/repo.git" DEST="/srv/prefect/runs/{{ flow_run.id }}" REPO_URL="https://github.com/org/repo.git" COMMIT="{{ commit_sha | default('origin/main') }}" mkdir -p "$(dirname "$BASE")" if [ ! -d "$BASE" ]; then git clone --mirror "$REPO_URL" "$BASE" fi cd "$BASE" # serialize cache updates exec 9> .fetch.lock flock 9 git fetch --prune --tags origin flock -u 9 # create per-run checkout git worktree add "$DEST" "$COMMIT" || { git worktree remove -ff "$DEST"; git worktree add "$DEST" "$COMMIT"; } Then skip the git_clone step. pip_install_requirements is still optional if you go with the venv from (2). 4) If you can live with “clone once per deployment” (not per run) - Choose a deployment-specific path (e.g., /srv/prefect/deployments/my-deployment), and guard your update step with a lock so concurrent runs don’t fight. All runs read from the same tree. - Caveat: if you checkout different commits per run, this won’t work; otherwise it’s very efficient. 5) Throttle while you transition
- Set concurrency limits on the work pool or work queue to keep a lid on simultaneous pull steps until you’ve implemented one of the above. CLI (verified): - Work pool limit: prefect work-pool set-concurrency-limit "my-pool" 10 - Work queue limit: prefect work-queue set-concurrency-limit "my-queue" 5 Docs: - Work pools: https://docs-3.prefect.io/v3/concepts/work-pools - Global/concurrency limits: https://docs-3.prefect.io/v3/how-to-guides/workflows/global-concurrency-limits A few clarifications that often help: - git_clone destination: it clones into the current working directory; to control the path, use set_working_directory first. - directories in git_clone enables sparse checkout of subpaths, which can speed things up (less to fetch) but doesn’t change where it clones. - If you pin a commit_sha in the deployment, make sure your checkout strategy (worktree or shared deployment dir) is compatible with it. Questions to tailor the solution: - Are your process workers running on a single host with a shared filesystem, or multiple hosts? - Roughly how large is your repo and how often do dependencies change? - Do you prefer to manage a long-lived venv per deployment/project on the worker host(s), or would you rather script the git caching approach? If you want quick relief: do (1) to stop the lock errors, then implement (2) to eliminate pip installs, and optionally (3) to stop full re-clones. Your future self will thank you, even if I won’t.