I find it a bit clunky that we have to either writ...
# ask-community
r
I find it a bit clunky that we have to either write all our code in one giant file, or rebuild + upload a docker image every time any of the other files change. Is Orion going to have a cleaner way to structure flows in multiple files?
👀 1
upvote 2
c
If you are using KubernetesAgent with a custom job template, you can add a K8 command (ie entrypoint) for the job container to pip install your Git repo. With this setup, every time a new flow run is executed, K8Agent spawns a new job which first pip installs the latest user-defined code directly from Git. This decouples user code (which changes often and is pulled upon every new flow run) to external dependencies (which you save into the job’s image and does not have to be rebuilt after every user file change)
🤔 1
Not sure about Orion though! Would like a breakdown for that
e
https://skaffold.dev/ makes rebuilding and updating a lot less painful
a
@Ryan Sattler It could be just a matter of packaging. I assume you use DockerStorage, correct? You could have a look at using Script based storage with either: • one of Git storage classes (GitHub, Bitbucket, Gitlab, Git) • one of cloud storage classes (S3, GCS, Azure) Then, you could pass your container image (that contains all dependencies needed by a flow) to your run configuration (e.g.
KubernetesRun
) and Prefect will grab the flow code either from Git or from cloud storage during a flow run. This way, you don’t have to rebuild your Docker image when the flow code changes. Does it make sense for your use case? When it comes to Orion, you’re right that it will likely be easier, since Orion decouples flow from its Deployment and allows attaching multiple deployments to a flow.
z
We'll also be revisiting user/local dependencies in Orion to streamline this experience. The flow data attached to a deployment is just a blob--it can be a pickle of the flow or an encoded location to pull the flow from. In the future, it could be a tarball of the flow and its dependencies, a description of a virtual environment, a deep-cloudpickle that contains all of the dependencies inline, etc. The design is better suited to handling this use-case, we just need to determine the best way to expose it.
upvote 1
r
Thanks Anna, I’m currently using S3 storage so that part is ok - the problem is when local python files other than the main flow script itself change. Michael - something along those lines sounds good.
👍 1
c
job_template.yaml
custom job template for K8 agent's jobs
Copy code
apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
      - name: flow
        command: ["tini", "-g", "--", "/usr/bin/prepare.sh"]
        env:
          - name: IS_PIP_PACKAGE
            valueFrom:
              configMapKeyRef:
                name: flow-env-vars
                key: is-pip-package
          - name: REPO_NAME
            valueFrom:
              configMapKeyRef:
                name: flow-env-vars
                key: repo-name
          - name: GIT_REF
            valueFrom:
              configMapKeyRef:
                name: flow-env-vars
                key: git-ref
          - name: GITHUB_ACCESS_TOKEN
            valueFrom:
              secretKeyRef:
                name: github-auth
                key: password
With
prepare.sh
Copy code
#!/bin/bash
set -x

if [ "$IS_PIP_PACKAGE" ]; then
    echo "IS_PIP_PACKAGE environment variable found."
    "$CONDA_DIR/envs/$conda_env/bin/pip" install "git+https://$GITHUB_ACCESS_TOKEN@github.com/$REPO_NAME.git@$GIT_REF"
fi

# Run extra commands
exec "$@"
Hello @Ryan Sattler , if you are using Kubernetes you can consider using the setup above. Package up all your Python and non-python code with
setuptools
, then change set $GIT_REF in the K8 namespace's configmap to "main" or the PR ref that the developer is working on. And use absolute imports for everything (in development you can install your user-defined package with
pip install -e .
)
This seems to work really well alongside a CICD pipeline with automated K8 agent deployments. A new agent is deployed to a separate namespace every time a new PR is created (with $GIT_REF data in the configmap changed to "feat/pr-16-branch" for example). I'm using Helm to manage these configmap deployments. Moreover, your "prod" agent with "GIT_REF=main" will always pull the latest changes packaged with
setuptools
every time it spins up a new job to execute a flow run. These changes can be Python modules used in your flow or even non-Python files (https://setuptools.pypa.io/en/latest/userguide/datafiles.html) Hope this clarifies my previous reply!
r
Thanks - for now we are just installing our code (which is configured as a python package with a setup.py) at runtime by including the internal github url (in pip format) in the
EXTRA_PIP_PACKAGES
env var in KubernetesRun which seems to work.
This does require the code to be committed to git but at least avoids rebuilding the docker image
s
^ I was considering exactly this @Ryan Sattler! do you require that the code be merged to your default branch when you do this or can you install the package from a particular branch in the source repo w/ a corresponding github url?
z
The $GIT_REF can be a commit or branch
r
@Sean Talia You can install a branch, see the first answer here for the URL syntax: https://stackoverflow.com/questions/20101834/pip-install-from-git-repo-branch