Artem Vysotsky
05/09/2022, 12:51 PMflow
directory:
(venv) ➜ src git:(main) ✗ ls -la flow
drwxr-xr-x 6 avysotsky staff 192 May 9 08:47 .
drwxr-xr-x 5 avysotsky staff 160 May 9 08:34 ..
-rw-r--r-- 1 avysotsky staff 3301 May 9 08:47 flow.py
-rw-r--r-- 1 avysotsky staff 870 May 8 10:11 graphql.py
-rw-r--r-- 1 avysotsky staff 611 May 9 08:47 prefect_client.py
And here is how I create the deployment:
d = DeploymentSpec(
flow_location="./flow/flow.py",
name=name,
schedule=CronSchedule(
cron=schedule
),
tags=[
f"user_id:{user_id}",
f"job_id:{job_id}"
],
parameters={
user_id: user_id,
job_id: job_id
}
)
Is Prefect smart enough to pull in all the deps that flow.py needs?Anna Geller
05/09/2022, 1:47 PMDoes agent need all deps that flow requires?not at all! there is a separation of concern now in that the
flow_runner
is responsible for deploying the relevant infrastructure such as a Docker container or a Kubernetes job - the agent is only responsible for picking up scheduled runs from the work queue and the flow_runner
takes care of the entire infrastructure workIs Prefect smart enough to pull in all the deps that flow.py needs?depends on the
flow_runner
you choose - you may create a virtual environment with conda and point your flow runner at the relevant environment that has all the dependencies - seeArtem Vysotsky
05/09/2022, 1:50 PMAnna Geller
05/09/2022, 1:54 PMArtem Vysotsky
05/09/2022, 1:55 PMAnna Geller
05/09/2022, 1:56 PMis smart enough to package graphql.pyby default, Prefect doesn't package any code for you, you would need to do it e.g. as part of CI/CD and point at it in your flow runner
Artem Vysotsky
05/09/2022, 1:57 PMAnna Geller
05/09/2022, 1:58 PMArtem Vysotsky
05/09/2022, 2:02 PMAnna Geller
05/09/2022, 2:33 PMflow_runner=DockerFlowRunner()
will be enough
but to be fair, your point is totally valid, and we are working on easier way of packaging code dependencies, it will get easier in the future 🤞
LMK if you need more help building a Docker image, I can try to build an exampleArtem Vysotsky
05/09/2022, 2:39 PMAnna Geller
05/09/2022, 2:48 PMWhy else do you need the storage then?mainly due to the hybrid execution model to respect your privacy; think of Storage as a map to where your flow is located - it can point to an object in S3, to a local file in Local storage or to a flow file on GitHub (not available yet, on the roadmap)
what is the point of creating docker image AND storing the flow file on a blob store?I can understand the confusion but think of a use case when your flow code may change very frequently but your code dependencies don't - you may always rely on the same Snowflake, Pandas, scikit-learn, and dbt package versions, but your definition of the data flow (your data transformations, ML models, etc) may change frequently as your business use case evolves It also allows you to reuse the same image across multiple flows, which is often required by many teams who don't wish to have one image per flow, which can also get "heavy" and storage-intensive