Hi, is there a way to use the github flow storage ...
# ask-community
d
Hi, is there a way to use the github flow storage backend not only to store and retrieve the flow to run, but also some common code shared between flows, like a file commons.py? ... if I deploy the commons.py on agent side, it can be imported, but that restricts me to one version, whereas the flows can be always the newest version from a git branch. thx
k
Hey @Daniel Bast, so the quick answer is not really as we recommend using Docker Storage and pip installing the project as a module inside the container. But, this might be doable with Git storage. Git storage clones the whole repo, but it does not install it as a package because that would be reinventing how Python packages are setup so we leave that to the user. So Git storage clones the whole repo and is intended for stuff like
.yml
and
.sql
files. The Git storage class clones the whole repo, runs the script with the flow code, and then deletes that temporary repo. So in order to get this to work, you might have to change your imports (not super sure but I think they should all be absolute rather than relative and then imported from the project root). Second, all the imports must happen before tasks start executing (so put them at top of the file) because the repo will get deleted before the flow runs. A bit more docs here
d
Thanks.... I just did read the github storage code from beginning to end... did not think about to read the plain git storage backend code ... will try that...
👍 1
if everything in the root of the repo then
prefect build/register
and importing during flow run is fine... if everything is in one subfolder (flows + common code), it requires then sys.path.append hacks 😕
c
Hey @Daniel Bast, I'm also using github storage with my tasks in a different subdir than my flows. Not quite sure if this is what you are looking for...
I found that the best solution is to package your task and flows (with pyproject.toml, setup.py / setup.cfg files), then install the pkg via pip+git to the executor environment
Lastly, replace all relative paths with absolute paths (e.g.
import mypkg.common_code
) to avoid all sys.path.append hacks!
I guess this setup might assume that you are spinning up an ephemeral Dask executor (that installs the latest dependencies before running the flow)....
k
Hey @Daniel Bast, unsure if it’ll work but the Flow code is run first before deleting the repo so maybe it’s a matter of changing working directory before the temp repo gets deleted?
d
yeah... that is what I do right now... have a setup.py and install everything in the executor/agent environment... problem is, if flows load from git storage, they can be newer than rest of the code that was previously installed into the environments... that can lead to mismatches
agents are long running with a connected dask cluster as executors
k
You could also
clone
the repo in your flow yourself and manage it that way? You might be able to use this utility class, but just note it’s not public facing so we might change it.
d
well, cloning is not the problem so far .... problem is with https://docs.python.org/3/library/runpy.html which loads the flow... the flow then tries to import the common code from the repo... but import path etc is not set in a way by runpy so that the import from the repo subfolder work
k
Ah I see what you mean. So it seems you’d have to
pip install
it as a module? If you have a
setup.py
, could you clone it then run a shell command to
pip install
? Though I suppose that’s way more involved than the original intention
d
it works with pip install.. that is what I want to avoid... I maybe found a solution... reading more runpy code and testing
k
Would love to learn how if you get it working.
d
finishing for today... will look into that again on monday
seems like it would be possible with just a local executor ... with dask there is another pickle in between that, where I haven't yet find the exact code path to understand
k
Ah I see. I think it might be related to this . We use cloudpickle to bring things to Dask. But you might be able to do it with a Dask
register_worker_plugin
call if you have the repo?