I have a question if anyone has a second Our group is starti Prefect Community #ask-community

I have a question if anyone has a second. Our grou...

Frank Hereford

05/17/2022, 2:58 PM

I have a question if anyone has a second. Our group is starting to port a lot of airflow DAGs to prefect, and we’ve been using some patterns that include defining a storage option when a Flow is defined. I have just worked up a new Flow (my first one) and I started from scratch. I have not had to configure or use storage, and I don’t really think I understand what the role of storage is in a flow’s definition. I’ve read the linked docs entry, and I still am confused. Can anyone explain what storage is at a high level? Thank you!

emre

05/17/2022, 3:16 PM

Welcome 🙂 Storage specifies where and how your flow (and thus the logic of your workflow) is stored. It contains the logic (code) of all the tasks within your code, and also how the tasks are dependent on each other. The need for storage imo is that no other component in prefect necessarily stores your flows logic: • The prefect server can show what tasks there are in your flow and how they are dependent to each other, but it lacks the logic contained within each task (code within init, run and other methods within your task.) This is due to the hybrid execution model of prefect, and allows you to keep your logic within your company, even if prefect server is managed outside. • The run config has to provide a python runtime, and any required libraries for your flow. In many cases this ends up including your flow itself as well (you just throw in your python project and install requirements), but you don't need to include your flow code in the run config. This way, you can reuse the same run config for multiple flows, by only changing their storages. Like having a docker image that has all your machine learning libraries, and running both training and batch prediction flows on the same docker image, but with different prefect storages.

🙏 1

❤️ 1

👀 1

Frank Hereford

05/17/2022, 3:29 PM

Emre, thank you so much for taking the time to describe that. May I pose a question back to you? I have this conception of what is happening when in my development process, and I’m curious if it matches up to reality. I’ve got a python script which has a number of tasks, some of which use various python packages, and a flow defined that is used to serialize (for lack of a better word) the order of these tasks into a DAG. This script is developed in a docker container which has been prepared with all of the support libraries that may be needed. When the

flow.register()

method is called, the individual tasks are compiled into python bytecode and pickled using

cloudpickle

. This frozen representation of the compiled tasks is sent over the network up to Prefect’s cloud service. (We’re a subscriber, if that is helpful information.) At the same time, I have an agent running. This agent runs from the same docker image, and in fact is the entry point for the container. In this way, the docker image serves both as a development environment and the environment for the agent to run. When the flow is to be kicked off as per the schedule or a one-off execution, the compiled bytecode is sent from the prefect cloud service down to the agent, where it is executed. This bytecode expects any library or package dependencies as defined when it was first compiled to be registered to be in place available to the agent.

Frank Hereford

05/17/2022, 3:30 PM

I realize that this second question may seem a bit of departure from the first, but I think I’m struggling to get a bigger picture of what is happening behind the scenes with the moving parts of the prefect ecosystem.

emre

05/17/2022, 3:44 PM

You got it mostly right, here are the parts that I think is lacking: • Once you

register

your flow, the entire flow object is serialized with

cloudpickle

(this includes tasks and edges). • The bytecode IS NOT sent to prefect cloud, only metadata about your flow (task names, edges, storage definition and run configs etc.) is sent to prefect cloud. This prevents prefect cloud from seeing your code. • The bytecode is sent to whatever you have configured for the flow storage. • Your agent sees that you want to run a flow, and tries to execute it. I am presuming you are running a

LocalAgent

within a docker container. • From prefect cloud, your

LocalAgent

receives not the flow bytecode, but metadata about the flow storage. The agent uses this metadata to find and fetch the bytecode itself. Then the bytecode is deserialized and executed within the python environment that you have prepared for your flow (your docker container.)

emre

05/17/2022, 3:46 PM

If you share the values for your storage and run configs (or confirm that they aren't modified), we can go over a concrete example.

Frank Hereford

05/17/2022, 3:46 PM

You read my mind, thank you! I’m getting you a link. It happens that where I work, all of our repos are open source, so I can show you the exact thing. I’m pushing it up to github right now

Frank Hereford

05/17/2022, 3:46 PM

https://github.com/cityofaustin/atd-prefect/blob/vz-crash-import/flows/vision-zero/cris_import/cris_import.py

Frank Hereford

05/17/2022, 3:47 PM

I am very grateful for your time to look at this and your insight!

❤️ 1

Frank Hereford

05/17/2022, 3:47 PM

Two directories up from that file is a Dockerfile that defines the environment that the agent runs in. I am also developing the flow while attached to that same docker container.

Frank Hereford

05/17/2022, 3:49 PM

Coming back to storage — I think all of the assumptions you laid out above are in fact true. I am defaulting into the local storage by not defining what I want to use, and I am running a local agent (

prefect agent local start --no-hostname-label -l vision-zero -l atd-data03

)

emre

05/17/2022, 3:53 PM

Yeah, afaik

UniversalRun

just matches with any type of

Agent

, and hopes everything works out 😅 . In your case

UniversalRun

simulates a

LocalRun

, since your flow run is captured by a

LocalAgent

Frank Hereford

05/17/2022, 3:56 PM

I feel that by packing everything onto a single running container, I in a way “got lucky” with the default storage configurations lining up and becoming essentially invisible to me. Now that I see that bytecode does not traverse up via prefect cloud to agents, there has to be a storage definition. It is the mechanism which the actual payload of logic/code that is being registered is conveyed to the agent for execution.

☝️ 1

emre

05/17/2022, 3:56 PM

As for storage, the default storage is in fact

LocalStorage

, with

stored_as_script

enabled, and path field set to whatever file your flow code was in. This works out, because your registration and execution environments are identical. Well not identical since they could have been different docker containers, but same docker image. Therefore same file layout and same python

Frank Hereford

05/17/2022, 3:57 PM

Emre - you have helped me a ton! Thank you for taking the time to think about this and chat about it. I owe you one!

emre

05/17/2022, 3:57 PM

Sure, it was very enjoyable 😊

emre

05/17/2022, 4:01 PM

getting "lucky" doesnt mean it is bad design btw, I use the same docker image for reigstry and execution as well. Theres a reason those were the default options

🧠 1

2 Views

Open in Slack

Previous Next