Hello everyone I m having difficulty trying to use GitStorag Prefect Community #ask-community

Hello everyone, I'm having difficulty trying to us...

Mourad Hamou-Mamar

10/18/2021, 8:27 AM

Hello everyone, I'm having difficulty trying to use GitStorage on my prefect instance. I have setup a prefect server and a flow that is being pulled via GitLab. Everything is working fine on that side. My flow and all of the directories where my flow is stored are pulled. But some of my script can't find my modules. I have imported the root directory to sys.path in my flow.py using :

Copy code

parent_path = Path(__file__).resolve().parent
sys.path.append(os.path.relpath(parent_path))

My structure looks like this :

Copy code

.
├── flow.py
├── functions
│   ├── function_mapping
│   │   ├── function_a.py
│   │   └── function_b.py
│   └── function_transform
│       ├── function_c.py
│       └── function_d.py
└── task
    ├── mapping
    │   └── task_a.py
    └── transform
        ├── task_b.py
        └── task_c.py

In my flow.py, I import my task doing :

Copy code

# in flow.py
from task.mapping.task_a import task_a

And it works for the

task

module. But in my task when I try to use my functions, it doesnt find them :

Copy code

# in task_a.py
from functions.function_mapping.function_a import function_a

I always get the error message :

ModuleNotFoundError: No module named 'functions'

I don't get why it wouldnt find my

functions

module since I make the import from the root directory and this last one is added to sys.path. If anyone have any idea about how I could make it work or what should I try to debug this situation, it would be greatly appreciated. Thanks everyone in advance.

Anna Geller

10/18/2021, 8:53 AM

Hi @Mourad Hamou-Mamar, which agent and run configuration do you use? I think the problem is that your Git storage is only pulling your flow file and not the other modules, therefore those extra modules are not installed/available in your environment. There are a couple of ways you could solve it. 1. Build a custom docker image and install your package inside the image - here is a walkthrough showing how you can do this 2. You could leverage Module storage instead of Gitlab 3. And a quite hacky workaround would be to clone your repo as initial task (perhaps also install it with a subprocess: pip install .), and do the imports in the downstream tasks e.g.

Copy code

@task
def pull_your_repo(repo_url:str):
    pygit2.clone_repository(url=repo_url, path="your_module")

@task
def use_cloned_module():
	import <http://your_module.XYZ|your_module.XYZ>
	pass # your code using the module

#1 would be the recommended approach.

Anna Geller

10/18/2021, 8:57 AM

You could also check your PYTHONPATH:

Copy code

import sys
print(sys.path)

Mourad Hamou-Mamar

10/18/2021, 9:01 AM

Hello, thanks for the answer. I use LocalAgent and I didnt setup the run config so I guess I used the default one. It seems weird cause I have tested that all the files were pulled. I used the prefect logger in my flow.py to print the path of all the files pulled in the /tmp/ directory on the run. And all of the files are correctly pulled. It has no problem accessing my task module files. I have also checked the sys.path since I add the root directory in my flow file to make sure all of my module can be accessed on local. I will look into your solution and also the run configuration option.

Anna Geller

10/18/2021, 9:24 AM

@Mourad Hamou-Mamar you can also explicitly import paths the local agent will add to all flow runs using the -p flag:

Copy code

prefect agent local start -p /Your/path/to/extra/module

Anna Geller

10/18/2021, 9:26 AM

And in the same way, you can add explicit path to the run configuration:

Copy code

flow.run_config = LocalRun(working_dir="/path/to/working-directory")

Mourad Hamou-Mamar

10/18/2021, 9:41 AM

Is there a difference between adding the path to the agent like you specified and adding the path to my extra module to sys.path like I did ? Also in my understanding, the LocalRunConfig, will allow me to specify a workspace directory and it wont pull the code to that directory. If I understand correctly, my output file and other working file will be placed in the working directory but the code itself, will remain pulled into a temporary directory. Can it be possible to specify the directory where we want the code from GitLab to be pulled ? Might not be a prefect problem after all, more of a python.

Anna Geller

10/18/2021, 10:15 AM

1. I think there is no difference per se, but adding this path to the agent is a bit cleaner, since you no longer need to add boilerplate code modifying the Python path in your flow code 2. LocalRun only sets the working directory, it doesn’t pull any files. But it allows you to run flow from a directory that may have access to your modules or custom configuration 3. GitLab storage would only pull the flow file, without your custom modules. If you use LocalAgent, LocalStorage or ModuleStorage seem to be much more convenient to use than a GitLab storage, especially if you have custom modules your flows need access to. But if you want, you can use GitLab storage and inject your module dependencies in some other way, e.g. installing them as a package on your agent.

Mourad Hamou-Mamar

10/18/2021, 11:48 AM

1. I understand. It's true that it's cleaner. But if I need to add another flow using other path, I will need to restart my agent to add the new path. I like the idea that in my flow.py file, I can dynamically add the path that I need. Actually it only takes 3 line of code in one file for a project. 2. I understand and it will work if I use the option you precise in point 3. But I will rather keep everything as dynamic as possible since the machine hosting the agent would probably change over time. And from the documentation, I will have to create the working directory every time.

Mourad Hamou-Mamar

10/18/2021, 11:53 AM

I have found the problem yet I dont understand why. I have to apolozige because I have not been precise. I'm not actually using GitLab storage but GitStorage. So like the documentation explain, the GitStorage actually pull all the files,directory,subdirecty in the same folder of your flow.py. So Everything was correctly pulled. It doesnt seems the be the same with GitLabStorage. The problem seems to come from the Executor I was using. I used DaskExecutor for my flow and when I changed to LocalDaskExecutor, for some reason, all import have been resolved no problem. I'm probably missing a lot of knowledge to understand why it worked like that but it's probably logic. Thanks for the help and sorry for the loss of time 😅 At least I learned about RunConfig and other storage solution.

Anna Geller

10/18/2021, 11:58 AM

perhaps you could start another agent then and match the flow and your agent via labels? I think it has a lot to do how you structure your repo. If you use mono-repo, then you only need to set this path once on the agent.

Anna Geller

10/18/2021, 12:05 PM

Don’t be sorry, I’d like to help as much as I can 🙂 When it comes to why LocalDaskExecutor worked and DaskExecutor didn’t, LocalDaskExecutor parallelizes your code using local threads and processes on your machine, rather than on a separate cluster. That’s why this executor could get all your imports without any issues. But a DaskExecutor spins up a local Dask cluster under the hood and this setup is a bit more involved.

Mourad Hamou-Mamar

10/18/2021, 2:24 PM

Well I think i'm gonna reconsider the structure of my repo since it's not very viable with mutliple flow. I will consider changing it in order to be able to use only one path and so adding it to the agent. Thank you !

👍 1

Mourad Hamou-Mamar

10/18/2021, 2:25 PM

As for the explaination around DaskExecutor. I think I understand a little bit more with your comment and the doc I could find about why it woesnt working. Thank you for your time.

Anna Geller

10/18/2021, 2:26 PM

absolutely, LMK if you have any other issues

Kevin Kho

10/18/2021, 2:46 PM

If it works for LocalDaskExecutor but not for DaskExecutor, it’s because the DaskExecutor workers do not have access to those files. Even if you pip install something on the scheduler, it needs to be pip installed as well of the workers. You would likely need to put these files in the container that the DaskExecutor workers use.

upvote 1

8 Views

Open in Slack

Previous Next