Hi All, We're starting to onboard more and more p...
# ask-community
s
Hi All, We're starting to onboard more and more people from our organization onto Prefect, and as we do so, one of the things we're noticing is that the flows that many people are authoring have the same handful of Python library dependencies – maybe 4 or 5 of the same libraries (e.g.
numpy
,
pandas
,
snowflake-connector-python
). Then our users will go and write a handful of their own custom Python classes and modules that they need for their flow. In order to make these custom modules available for use in their flows, people have been creating slight variations of the same docker image that have that same set of python packages installed in it, and then just
COPY
their project's code into the image – at that point, their RunConfig image has everything they need in it to run their flow. One of the issues I'm foreseeing with this approach is that it's going to lead to a lot of image bloat in terms of the number of images we'll have in use across our flows – images whose Dockerfiles might be found across several different repositories – so we'll be maintaining a lot of images that hardly differ from one another save for a handful of custom Python modules that people copy into them. I'm trying to see if there's an approach that avoids this – or at least avoids it in a way that has a favorable tradeoff. Maybe instead of these custom modules needing to be available at registration/build time, they can simply be retrieved at runtime from S3, for example? If that were possible, the management overhead now moves to S3 rather than our image repository, but I think that's easier to deal with; plus many of our users who need/want to build these flows don't necessarily want to be in the business in building and managing Docker images.
👀 1
s
@Sean Talia We have a similar approach in my organization running on Kubernetes. We have a base image pre-installed with our most common dependencies and flow code is stored in GIT and pulled at flow runtime. There's a variety of storage options such as git, S3, even Webhooks (apparently). https://docs.prefect.io/orchestration/execution/storage_options
upvote 1
Plus, if you happen to be using Kubernetes you can specify extra pip installs to your image as part of the KubernetesRun config option to get installed during flow launch https://docs.prefect.io/api/latest/run_configs.html#kubernetesrun
k
@Sam Cook brings up a super good suggestion with the extra pip installs. You can also install them on a Dask cluster on the fly I think with this worker plugin
👀 1
s
Oh I suppose I should have mentioned that we already are using S3 for our flow storage. I must be missing something very obvious here then? I'm looking at the code example for the S3Storage option, which looks like:
Copy code
from prefect import Flow
from prefect.storage import S3

flow = Flow("s3-flow", storage=S3(bucket="<my-bucket>"))

flow.storage.build()
whenever we define our flows, we always define the flow body as well, and it's the flow body that's going to need to use these lightweight custom modules that users are authoring
c
Hello @Sean Talia, I think I've got a similar setup as @Sam Cook. Perhaps this example
job_template.yaml
can clarify the approach outlined above? https://prefect-community.slack.com/archives/CL09KU1K7/p1634175625338700?thread_ts=1634016930.185300&amp;cid=CL09KU1K7