Thread about loading flows from modules instead of py files Prefect Community #ask-community

Join Slack

Thread about loading flows from modules instead of...

# ask-community

Ben Ayers-Glassey

05/05/2022, 5:27 AM

Thread about loading flows from modules instead of .py files or cloudpickle-serialized blobs =>

Ben Ayers-Glassey

05/05/2022, 5:27 AM

There seem to be 3 ways to load flows at a low level: the functions

extract_flow_from_file

extract_flow_from_module

, and

flow_from_bytes_pickle

. They all live in `prefect.utilities.storage`: https://github.com/PrefectHQ/prefect/blob/6d6c7bff2b9fc8db7d44ad7f9596fcd67b5b1691/src/prefect/utilities/storage.py Now, loading .py files with

eval()

and loading Flow objects using cloudpickle are kind of cool, but it makes it difficult to write traditional Python code. I'm trying to write a bunch of flows, living in separate .py files within the same Git repo, and I find myself leaving big comments all over the place warning not to treat these .py files as modules, because when they get loaded by the Docker storage class, it'll be using cloudpickle, etc etc. Except that when writing unit tests, you of course do treat these files as modules (to import functions out of them to test). So that's already difficult to explain to other devs. And in order for these flows to share code, I need to make an entire pip-installable package; and since our code is private, I need to host that package in a private PyPi server; and then I need to convince the Docker class to load requirements from that server; etc etc. So it seems like it would be much nicer to use

extract_flow_from_module

, but looking at the source code, only 2 storage classes use it:

Local

and

Module

. So it seems like it would be nice if more storage classes were able to load from modules, maybe with a

load_from_module=True

kwarg or something. For instance, the

GitHub

storage class originally seemed like the most straightforward; but it can only load flows by calling

eval()

on single .py files from the Git repo. It would be really nice if it could just clone the whole repo and then behave like the

Module

storage class. Does that make sense? At the moment, we're considering moving almost all our code out of the individual flow .py files and into the pip-installable package; so then everything becomes "regular old Python modules", and we would just need one small component which knows how to import Flow objects from that module and register them. Does this all sound like something people have already grappled with, and come up with good solutions for?

Anna Geller

05/05/2022, 10:03 AM

This seems to be an issue with packaging code dependencies, correct? Did you try building a Python package to make those other flows and callables importable?

they get loaded by the Docker storage class, it'll be using cloudpickle, etc etc

you don't have to use Docker storage - Docker storage is a convenient way of packaging flow code with their dependencies but you can totally use containerized execution platform using your image build process

since our code is private, I need to host that package in a private PyPi server

You don't have to host the package anywhere, you only need to install it in your execution platform. You can install it from setup.py rather than from pip, check this simple example 'm sure you can find a way to package your code without having to modify the Storage source code

we're considering moving almost all our code out of the individual flow .py files and into the pip-installable package

You could do that, but the easiest would be to build a package only for code that needs to be shared across flows

Does this all sound like something people have already grappled with, and come up with good solutions for?

Yes, absolutely. Usually building a package and installing it in the execution environment or Docker image solves the problem

Anna Geller

05/05/2022, 10:04 AM

Generally, you can think of: • Storage as a way to package your flow code only. • your Docker image and execution layer as a place to package code dependencies

Ben Ayers-Glassey

05/05/2022, 5:08 PM

Thanks for the response!

You don't have to host the package anywhere, you only need to install it in your execution platform. You can install it from setup.py rather than from pip, check this simple example

I see. Actually I found this file was the one which showed me what I was confused about, in particular seeing which kwargs of

Docker

it used:

Copy code

docker_storage = Docker(
    image_name="community",
    image_tag="latest",
    registry_url=f"{AWS_ACCOUNT_ID}.<http://dkr.ecr.eu-central-1.amazonaws.com|dkr.ecr.eu-central-1.amazonaws.com>",
    stored_as_script=True,
    path=f"/opt/prefect/flows/{FLOW_NAME}.py",
)

...so it's not using the

base_image

and

python_dependencies

kwargs, and furthermore

build=False

is being passed to `flow.register(...)`; so basically instead of having the

Docker

class build a Dockerfile behind the scenes, you're just writing your own Dockerfile by hand, giving you full freedom over what goes into it.

I'm sure you can find a way to package your code without having to modify the Storage source code

Yes, it seems like it. I still think it would be useful to be able to specify flows using a dotted path instead of a file path when using the

Docker

storage class, seeing how much easier it is to package up shared code when you write the Dockerfile yourself, I don't think that's a huge issue! Thanks again 🙂

Anna Geller

05/06/2022, 12:36 PM

you're just writing your own Dockerfile by hand, giving you full freedom over what goes into it.

Exactly! You're very welcome, LMK if you have any open questions

🙌 1

99 Views

Open in Slack

Previous Next