i have the following setup, where `flows.py` has m...
# ask-community
d
i have the following setup, where
flows.py
has my flow definition, and
utils.py
hosts some number of helper functions. And I'm using Github storage for the flow. Since Github storage only specify the path of the
flows.py
, it got module not found during execution. I wonder if Github storage does not support module? I can't seem to find a good descrption in the documentation. Thanks in advance if anyone can give me a pointer.
e
Hey, imported modules are not stored by flow storage options, like Github storage. Generally, supplying the necessary modules for your flow is the responsibility of
run_config
. Things like picking
DockerRun
with a docker image that has all the necessary modules present. Explanation below:
👍 2
Imports are not serialized and stored with your flow storage, due to how cloudpickle works. Cloudpickle stores imported modules as references to modules, instead of entire implementations embedded into the pickled thing. Say you are using
utils.sonme_fun()
within a task in your flow. • Github Storage serializes your flow (and tasks) via cloudpickle • cloudpickle stores
utils.some_fun()
as, follows: In a module named
utils
, there should be a function named
some_fun
, call that function. • Serialized flow is pushed to github (probably, I use other storages) So, flow doesn't store external modules, It doesn't even store the prefect module, it depends on the modules present in the executing environment
👍 3
d
thank you very much for the explanation, I wasn't able to put these pieces together before
👍 1
k
That’s a great explanation @emre!
😊 1
m
Does that mean that all of the functions for
flows.py
need to be in that same script? I was under the impression that the whole repo is cloned and we could import from modules defined in the repo (such as
utils.py
).
k
For Github storage, the whole repo is not cloned. For Git storage, it is cloned. But the Git storage was intended to pull stuff like .sql files. You might be able to get it to work if you handle the paths right on the agent and in the files, but if the repo is a Python module, then we’d probably recommend Docker Storage.
👍 1