https://prefect.io logo
p

Pedro Machado

02/25/2021, 4:56 PM
Hi there. A couple of questions about "git repo" storage: 1. Has anyone looked at Azure Repos? Any idea of the level of effort to implement a new storage class for Azure Repos? 2. I don't believe the storage classes are designed to work with multiple files (a file that has the flow and other "utility" modules). Am I correct? Is it possible to use the existing classes to implement this pattern or do we have to necessarily create a python package? I have a client who is considering Kubernetes orchestration but doesn't like the idea of baking the flow code in the docker image. They'd like for the flow code (and additional utility modules) to be pulled from a repo every time a flow needs to run.
z

Zanie

02/25/2021, 5:30 PM
Not sure about the difficulty of implementing an AzureRepo storage type but I know we'd be interested in supporting it.
As to your second point, the flow can't be dynamically run because the DAG of the running flow needs to match the DAG of the registered flow. You could dynamically import functions within your tasks to run but the tasks/dataflow need to be consistent with the flow metadata in the backend.
You could also use CI to update a base image on each commit to the repo then use a
DockerRun
referencing the 'latest' version of that base image to have your repo be always up to date in your flows run environment
p

Pedro Machado

02/25/2021, 5:58 PM
Thanks, Michael. Regarding the second point above, assume we register the flow every time we change it. Is there a way to use the existing storage to manage other python files that would go along with the flow? They'd like to have a stable base image that changes infrequently paired with faster changing flow code + util.
The use case is to orchestrate some model scoring code that runs on a container or in Azure ML. The flow code won't change very frequently but the model scoring code might. They wanted to be able to pull the model scoring code dynamically from a repo and have prefect run it.
In other words, suppose you have a simple from that 1. triggers a
dbt
run in dbt cloud 2. runs python code that pulls data from snowflake, scores it using a model, and pushes it back to snowflake 3. runs
dbt
again They want to be able to change the code that is run in step 2 independent of the flow code just like they can update the dbt code in a separate repo and have dbt cloud fetch it before each run.
z

Zanie

02/25/2021, 6:07 PM
There's not a way to manage other python files alongside the flow right now. I'd recommend doing something like:
Copy code
from prefect import Flow, task

def download_files_from_repo(repo_path, local_path):
    # Here you'd setup your download
    pass

@task
def get_newest_utils():
    # Utils should be `pip install -e /my_utilities` in the base image
    download_files_from_repo("my-repo/src/my_utilities", "/my_utilities")
    # Now we'll import the module dynamically to use the new files
    import my_utilities
    return my_utilities

@task
def do_something(utils):
    print(dir(utils))


with Flow("example") as flow:
    utilities = get_newest_utils()
    do_something(utilities)

flow.run()
Does that make sense?
p

Pedro Machado

02/25/2021, 6:21 PM
So the image would have the utilities preinstalled in editable mode but then you'd overwrite that path at run time?
I suppose that if it's a single file we could just store it in a dir that is in the
PYTHONPATH
. Correct?
z

Zanie

02/25/2021, 6:35 PM
Yep! You could do that as well.
👍 1
2 Views