how does Prefect 2.0 deal with dependencies? Does ...
# prefect-community
a
how does Prefect 2.0 deal with dependencies? Does agent need all deps that flow requires? Also, what if the flow file depends on local files? I.e. Here is my
flow
directory:
Copy code
(venv) ➜  src git:(main) ✗ ls -la flow                                 
drwxr-xr-x  6 avysotsky  staff   192 May  9 08:47 .
drwxr-xr-x  5 avysotsky  staff   160 May  9 08:34 ..
-rw-r--r--  1 avysotsky  staff  3301 May  9 08:47 flow.py
-rw-r--r--  1 avysotsky  staff   870 May  8 10:11 graphql.py
-rw-r--r--  1 avysotsky  staff   611 May  9 08:47 prefect_client.py
And here is how I create the deployment:
Copy code
d = DeploymentSpec(
        flow_location="./flow/flow.py",
        name=name,
        schedule=CronSchedule(
            cron=schedule
        ),
        tags=[
            f"user_id:{user_id}",
            f"job_id:{job_id}"
        ],
        parameters={
            user_id: user_id,
            job_id: job_id
        }
    )
Is Prefect smart enough to pull in all the deps that flow.py needs?
a
Does agent need all deps that flow requires?
not at all! there is a separation of concern now in that the
flow_runner
is responsible for deploying the relevant infrastructure such as a Docker container or a Kubernetes job - the agent is only responsible for picking up scheduled runs from the work queue and the
flow_runner
takes care of the entire infrastructure work
Is Prefect smart enough to pull in all the deps that flow.py needs?
depends on the
flow_runner
you choose - you may create a virtual environment with conda and point your flow runner at the relevant environment that has all the dependencies - see
a
So, let me rephrase it. Say, my flow.py has a required dependency file graphql.py and I point DeploymentSpec to flow.py. In this case DeploymentSpec is smart enough to package graphql.py and then agent is smart enough to use that dependency?
I think I actually don’t understand how exactly Prefect sends a given flow to an agent
a
maybe you can check the relevant flow runner code? I think the virtual environment sounds like a good approach to test out in your use case - if the graphql.py is available in that conda environment, you should be good to go, but the easiest would be to give it a try and see which flow runner works best for you my main point was that flow runner is the right place for you to explore dependency management
a
I see
I have to explicitly specify the runner
👍 1
a
is smart enough to package graphql.py
by default, Prefect doesn't package any code for you, you would need to do it e.g. as part of CI/CD and point at it in your flow runner
a
Do you have an example?
a
Orion docs have the best examples so far, but it depends on flow runner type
e.g. with Docker flow runner, you need to build an image and point at this image
a
I can’t find anywhere in the docs where it says that I have to build a docker image https://orion-docs.prefect.io/tutorials/docker-flow-runner/
a
well, you don't always have to do that - if you don't require other dependencies than Prefect, then just doing:
Copy code
flow_runner=DockerFlowRunner()
will be enough but to be fair, your point is totally valid, and we are working on easier way of packaging code dependencies, it will get easier in the future 🤞 LMK if you need more help building a Docker image, I can try to build an example
a
So, let me clarify why I got confused. Since the storage configuration is a required step to run prefect flow, I assumed that the packaging is a solved problem. Why else do you need the storage then? I.e. what is the point of creating docker image AND storing the flow file on a blob store? Why not just package the entire flow into a docker image?
a
Why else do you need the storage then?
mainly due to the hybrid execution model to respect your privacy; think of Storage as a map to where your flow is located - it can point to an object in S3, to a local file in Local storage or to a flow file on GitHub (not available yet, on the roadmap)
Storage = flow code Flow runner = infrastructure and code dependencies
what is the point of creating docker image AND storing the flow file on a blob store?
I can understand the confusion but think of a use case when your flow code may change very frequently but your code dependencies don't - you may always rely on the same Snowflake, Pandas, scikit-learn, and dbt package versions, but your definition of the data flow (your data transformations, ML models, etc) may change frequently as your business use case evolves It also allows you to reuse the same image across multiple flows, which is often required by many teams who don't wish to have one image per flow, which can also get "heavy" and storage-intensive