https://prefect.io logo
Title
b

Ben Ayers-Glassey

05/05/2022, 5:27 AM
Thread about loading flows from modules instead of .py files or cloudpickle-serialized blobs =>
There seem to be 3 ways to load flows at a low level: the functions
extract_flow_from_file
,
extract_flow_from_module
, and
flow_from_bytes_pickle
. They all live in `prefect.utilities.storage`: https://github.com/PrefectHQ/prefect/blob/6d6c7bff2b9fc8db7d44ad7f9596fcd67b5b1691/src/prefect/utilities/storage.py Now, loading .py files with
eval()
and loading Flow objects using cloudpickle are kind of cool, but it makes it difficult to write traditional Python code. I'm trying to write a bunch of flows, living in separate .py files within the same Git repo, and I find myself leaving big comments all over the place warning not to treat these .py files as modules, because when they get loaded by the Docker storage class, it'll be using cloudpickle, etc etc. Except that when writing unit tests, you of course do treat these files as modules (to import functions out of them to test). So that's already difficult to explain to other devs. And in order for these flows to share code, I need to make an entire pip-installable package; and since our code is private, I need to host that package in a private PyPi server; and then I need to convince the Docker class to load requirements from that server; etc etc. So it seems like it would be much nicer to use
extract_flow_from_module
, but looking at the source code, only 2 storage classes use it:
Local
and
Module
. So it seems like it would be nice if more storage classes were able to load from modules, maybe with a
load_from_module=True
kwarg or something. For instance, the
GitHub
storage class originally seemed like the most straightforward; but it can only load flows by calling
eval()
on single .py files from the Git repo. It would be really nice if it could just clone the whole repo and then behave like the
Module
storage class. Does that make sense? At the moment, we're considering moving almost all our code out of the individual flow .py files and into the pip-installable package; so then everything becomes "regular old Python modules", and we would just need one small component which knows how to import Flow objects from that module and register them. Does this all sound like something people have already grappled with, and come up with good solutions for?
a

Anna Geller

05/05/2022, 10:03 AM
This seems to be an issue with packaging code dependencies, correct? Did you try building a Python package to make those other flows and callables importable?
they get loaded by the Docker storage class, it'll be using cloudpickle, etc etc
you don't have to use Docker storage - Docker storage is a convenient way of packaging flow code with their dependencies but you can totally use containerized execution platform using your image build process
since our code is private, I need to host that package in a private PyPi server
You don't have to host the package anywhere, you only need to install it in your execution platform. You can install it from setup.py rather than from pip, check this simple example 'm sure you can find a way to package your code without having to modify the Storage source code
we're considering moving almost all our code out of the individual flow .py files and into the pip-installable package
You could do that, but the easiest would be to build a package only for code that needs to be shared across flows
Does this all sound like something people have already grappled with, and come up with good solutions for?
Yes, absolutely. Usually building a package and installing it in the execution environment or Docker image solves the problem
Generally, you can think of: ā€¢ Storage as a way to package your flow code only. ā€¢ your Docker image and execution layer as a place to package code dependencies
b

Ben Ayers-Glassey

05/05/2022, 5:08 PM
Thanks for the response!
You don't have to host the package anywhere, you only need to install it in your execution platform. You can install it from setup.py rather than from pip, check this simple example
I see. Actually I found this file was the one which showed me what I was confused about, in particular seeing which kwargs of
Docker
it used:
docker_storage = Docker(
    image_name="community",
    image_tag="latest",
    registry_url=f"{AWS_ACCOUNT_ID}.<http://dkr.ecr.eu-central-1.amazonaws.com|dkr.ecr.eu-central-1.amazonaws.com>",
    stored_as_script=True,
    path=f"/opt/prefect/flows/{FLOW_NAME}.py",
)
...so it's not using the
base_image
and
python_dependencies
kwargs, and furthermore
build=False
is being passed to `flow.register(...)`; so basically instead of having the
Docker
class build a Dockerfile behind the scenes, you're just writing your own Dockerfile by hand, giving you full freedom over what goes into it.
I'm sure you can find a way to package your code without having to modify the Storage source code
Yes, it seems like it. I still think it would be useful to be able to specify flows using a dotted path instead of a file path when using the
Docker
storage class, seeing how much easier it is to package up shared code when you write the Dockerfile yourself, I don't think that's a huge issue! Thanks again šŸ™‚
a

Anna Geller

05/06/2022, 12:36 PM
you're just writing your own Dockerfile by hand, giving you full freedom over what goes into it.
Exactly! You're very welcome, LMK if you have any open questions
šŸ™Œ 1