https://prefect.io logo
Title
d

Daniel Davee

06/03/2021, 9:39 PM
I'm using a dask executer and kubernetes run, to run a flow on a kubenetes cluster. But it doesn’t seem to want to import my sub modules when it runs on the cluster. Do the I need to include them on the worker or is there a way they can be uploaded with flow?
k

Kevin Kho

06/03/2021, 9:40 PM
Hey @Daniel Davee! By submodules you mean other code related to your flow that you import right? The best thing here is to probably package them into a Docker container to be loaded altogether.
d

Daniel Davee

06/03/2021, 9:41 PM
Yes I am.
Do you mean I need to create a new image for each ETL I have?
n

nicholas

06/03/2021, 9:45 PM
@Daniel Davee - it could be a little bit of that but perhaps using a base image that has your shared submodules installed that you either use directly for your flows or extend for those that require anything extra
But regardless, the submodules will need to be installed on whatever image you use so that they're accessible at runtime
d

Daniel Davee

06/03/2021, 9:53 PM
So what I'm trying to do is build the DAG's dynamically at run time. It seems like I would need to keep all ETL code in the same environment.
n

nicholas

06/03/2021, 9:54 PM
That's correct - any code that could be included in the dynamic DAG would need to be available to the process though, right?
d

Daniel Davee

06/03/2021, 9:56 PM
Yeah, I guess I hoped to be able to keep the code off of the image. Would a side car help? Sorry I literally just learned Docker and Kubenetes trying to get this to work.
I'm not totally sure what side car is or does, but would I keep all the code on one node, and move to the environment?
n

nicholas

06/03/2021, 10:02 PM
No worries! This is really up to preference but the most straightforward implementation would be to import the modules you need at container build time; this has the added benefit of containing your code at build time instead of doing something like pulling from source, which makes it difficult to understand if/when bugs are introduced or code changes. Since your flow will be deployed on a pod with the storage you built at registration time, you end up with a really good encapsulation for your flow
k

Kevin Kho

06/03/2021, 11:00 PM
You can install the ETL code as a Python module on that base image for your flows.
:upvote: 1
d

Daniel Davee

06/03/2021, 11:10 PM
How does that work?
k

Kevin Kho

06/03/2021, 11:14 PM
I have this repo as an example. There is a
Dockerfile
there. The last line of the Docker image is
RUN pip install -e .
This makes the module available in the image. Note you need a
setup.py
for this to work. This guide will have more info
d

Daniel Davee

06/04/2021, 12:10 AM
Could I have a private pip on a container in the pod, that can be installed from dynamically?
k

Kevin Kho

06/04/2021, 12:29 AM
Like you have an internal copy of PyPI for security purposes and you need to pip from there?
d

Daniel Davee

06/04/2021, 12:33 AM
Yeah, basically I'm trying to glue different functions together on a DAG dynamically. But really I need a place to store the code in some directory like structure so when I send the job to dask, it's able to import the code. Honestly if there is a way to set it up that this directory on another container is some how in the dask scheduler path, that would probably work best. But like I said I barely learned Docker the other day, so I don't even know if that's possible.
I suppose if I had a giant package that I pip installed from private pip directory, that could work too.
k

Kevin Kho

06/04/2021, 12:49 AM
The Docker container would serve the same purpose where you install all Python packages there like pandas, numpy, scipy, dask, etc. (the default prefect image contains dask and dask contains a lot of those so it might just be adding your module on top)
How do you spin up a Dask cluster?
About the private pip, you shouldn't need it in this scenario if Docker has everything but it's normally something your security team sets up for you and they give you a certificate, and then you save that somewhere and pip will use that for the private PyPI
d

Daniel Davee

06/04/2021, 4:18 PM
Sorry for late reply, So I have setup a dask executer on GCP which I connect to from the prefect server to run the code. It seems like the private pip isn't really what I am looking for. Is there a way to add to the dask executer a python path that would be housed on a different container. I could have all the code on the dask executer , but that seems like that would cause scaling issues. Am I correct?
k

Kevin Kho

06/04/2021, 4:22 PM
By DaskExectuor on GCP, do you mean GKE? I think it might be easier if you point to the address of the Dask cluster. You can do
DaskExecutor(address)
and point it. Prefect will then connect and execute the code there assuming it can make the connection.
We have a partner that specializes in infrastructure that I can introduce you to btw. A quick call with them might clear up the path you want to take for your deployment.
d

Daniel Davee

06/04/2021, 4:24 PM
Yes, sorry to many GXX stuff. This is what I already do. The problem is the matching environments and code access
That would be so great, this the my weakest part but I think I almost got it but I can't really be sure
k

Kevin Kho

06/04/2021, 4:32 PM
@George Coyne
No it’s fine, I probably can’t help you as good as our partner can