Pedro Machado

    Pedro Machado

    1 year ago
    Hi there. A couple of questions about "git repo" storage: 1. Has anyone looked at Azure Repos? Any idea of the level of effort to implement a new storage class for Azure Repos? 2. I don't believe the storage classes are designed to work with multiple files (a file that has the flow and other "utility" modules). Am I correct? Is it possible to use the existing classes to implement this pattern or do we have to necessarily create a python package? I have a client who is considering Kubernetes orchestration but doesn't like the idea of baking the flow code in the docker image. They'd like for the flow code (and additional utility modules) to be pulled from a repo every time a flow needs to run.
    Michael Adkins

    Michael Adkins

    1 year ago
    Not sure about the difficulty of implementing an AzureRepo storage type but I know we'd be interested in supporting it.
    As to your second point, the flow can't be dynamically run because the DAG of the running flow needs to match the DAG of the registered flow. You could dynamically import functions within your tasks to run but the tasks/dataflow need to be consistent with the flow metadata in the backend.
    You could also use CI to update a base image on each commit to the repo then use a
    DockerRun
    referencing the 'latest' version of that base image to have your repo be always up to date in your flows run environment
    Pedro Machado

    Pedro Machado

    1 year ago
    Thanks, Michael. Regarding the second point above, assume we register the flow every time we change it. Is there a way to use the existing storage to manage other python files that would go along with the flow? They'd like to have a stable base image that changes infrequently paired with faster changing flow code + util.
    The use case is to orchestrate some model scoring code that runs on a container or in Azure ML. The flow code won't change very frequently but the model scoring code might. They wanted to be able to pull the model scoring code dynamically from a repo and have prefect run it.
    In other words, suppose you have a simple from that1. triggers a
    dbt
    run in dbt cloud 2. runs python code that pulls data from snowflake, scores it using a model, and pushes it back to snowflake 3. runs
    dbt
    again They want to be able to change the code that is run in step 2 independent of the flow code just like they can update the dbt code in a separate repo and have dbt cloud fetch it before each run.
    Michael Adkins

    Michael Adkins

    1 year ago
    There's not a way to manage other python files alongside the flow right now. I'd recommend doing something like:
    from prefect import Flow, task
    
    def download_files_from_repo(repo_path, local_path):
        # Here you'd setup your download
        pass
    
    @task
    def get_newest_utils():
        # Utils should be `pip install -e /my_utilities` in the base image
        download_files_from_repo("my-repo/src/my_utilities", "/my_utilities")
        # Now we'll import the module dynamically to use the new files
        import my_utilities
        return my_utilities
    
    @task
    def do_something(utils):
        print(dir(utils))
    
    
    with Flow("example") as flow:
        utilities = get_newest_utils()
        do_something(utilities)
    
    flow.run()
    Does that make sense?
    Pedro Machado

    Pedro Machado

    1 year ago
    So the image would have the utilities preinstalled in editable mode but then you'd overwrite that path at run time?
    I suppose that if it's a single file we could just store it in a dir that is in the
    PYTHONPATH
    . Correct?
    Michael Adkins

    Michael Adkins

    1 year ago
    Yep! You could do that as well.