I'm using a dask executer and kubernetes run, to ...
# prefect-server
d
I'm using a dask executer and kubernetes run, to run a flow on a kubenetes cluster. But it doesn’t seem to want to import my sub modules when it runs on the cluster. Do the I need to include them on the worker or is there a way they can be uploaded with flow?
k
Hey @Daniel Davee! By submodules you mean other code related to your flow that you import right? The best thing here is to probably package them into a Docker container to be loaded altogether.
d
Yes I am.
Do you mean I need to create a new image for each ETL I have?
n
@Daniel Davee - it could be a little bit of that but perhaps using a base image that has your shared submodules installed that you either use directly for your flows or extend for those that require anything extra
But regardless, the submodules will need to be installed on whatever image you use so that they're accessible at runtime
d
So what I'm trying to do is build the DAG's dynamically at run time. It seems like I would need to keep all ETL code in the same environment.
n
That's correct - any code that could be included in the dynamic DAG would need to be available to the process though, right?
d
Yeah, I guess I hoped to be able to keep the code off of the image. Would a side car help? Sorry I literally just learned Docker and Kubenetes trying to get this to work.
I'm not totally sure what side car is or does, but would I keep all the code on one node, and move to the environment?
n
No worries! This is really up to preference but the most straightforward implementation would be to import the modules you need at container build time; this has the added benefit of containing your code at build time instead of doing something like pulling from source, which makes it difficult to understand if/when bugs are introduced or code changes. Since your flow will be deployed on a pod with the storage you built at registration time, you end up with a really good encapsulation for your flow
k
You can install the ETL code as a Python module on that base image for your flows.
upvote 1
d
How does that work?
k
I have this repo as an example. There is a
Dockerfile
there. The last line of the Docker image is
RUN pip install -e .
This makes the module available in the image. Note you need a
setup.py
for this to work. This guide will have more info
d
Could I have a private pip on a container in the pod, that can be installed from dynamically?
k
Like you have an internal copy of PyPI for security purposes and you need to pip from there?
d
Yeah, basically I'm trying to glue different functions together on a DAG dynamically. But really I need a place to store the code in some directory like structure so when I send the job to dask, it's able to import the code. Honestly if there is a way to set it up that this directory on another container is some how in the dask scheduler path, that would probably work best. But like I said I barely learned Docker the other day, so I don't even know if that's possible.
I suppose if I had a giant package that I pip installed from private pip directory, that could work too.
k
The Docker container would serve the same purpose where you install all Python packages there like pandas, numpy, scipy, dask, etc. (the default prefect image contains dask and dask contains a lot of those so it might just be adding your module on top)
How do you spin up a Dask cluster?
About the private pip, you shouldn't need it in this scenario if Docker has everything but it's normally something your security team sets up for you and they give you a certificate, and then you save that somewhere and pip will use that for the private PyPI
d
Sorry for late reply, So I have setup a dask executer on GCP which I connect to from the prefect server to run the code. It seems like the private pip isn't really what I am looking for. Is there a way to add to the dask executer a python path that would be housed on a different container. I could have all the code on the dask executer , but that seems like that would cause scaling issues. Am I correct?
k
By DaskExectuor on GCP, do you mean GKE? I think it might be easier if you point to the address of the Dask cluster. You can do
DaskExecutor(address)
and point it. Prefect will then connect and execute the code there assuming it can make the connection.
We have a partner that specializes in infrastructure that I can introduce you to btw. A quick call with them might clear up the path you want to take for your deployment.
d
Yes, sorry to many GXX stuff. This is what I already do. The problem is the matching environments and code access
That would be so great, this the my weakest part but I think I almost got it but I can't really be sure
k
@George Coyne
No it’s fine, I probably can’t help you as good as our partner can