I'm completely new to Prefect and had some questio...
# prefect-community
s
I'm completely new to Prefect and had some questions about pickling dependencies for containerized execution environments. We are working on a project that uses a thin wrapper over Prefect Flows to include some additional input and output information https://github.com/developmentseed/example-pipeline/blob/fargate_test/recipe/pipeline.py. My question is about how
cloudpickle
serializes my Flow's dependencies with S3 storage. Based on some information outlined here https://prefect-community.slack.com/archives/C014Z8DPDSR/p1605200879483400 I've got S3 storage configured and working but it seems that my Flow's upstream dependencies are not pickled when I register the flow as indicated in https://docs.prefect.io/core/advanced_tutorials/task-guide.html#task-inputs-and-outputs. Specifically I receive the following dependency error
Unexpected error: ModuleNotFoundError("No module named 'h5netcdf'")
. I can build an image with the necessary dependencies for use with the
DaskExecutor
and everything works correctly but our goal is to decouple our execution environment from Flows. Am I misunderstanding how Flow dependencies should be serialized by
cloudpickle
. Is there another approach I should be considering in this case? Thanks in advance.
j
Hi Sean, Python dependencies with Prefect have two parts: • Your flow and the tasks within it are usually all written in a single python file. A
Storage
class manages getting the contents of this to an execution environment. • The Python dependencies your code relies on (for example
h5netcdf
). These need to be handled separately by you using e.g. a docker image, conda environments, etc... If you want to decouple "execution environment" from flows I recommend having a superset of all dependencies needed by all flows in an environment - a prefect storage class has no way of moving around anything except your flow code itself.
In your case you might have a base image with all the dependencies you need, then use the
S3
storage to manage only your flow code. Then all flows share the same image.
Alternatively, flows can specify an image to use (it doesn't need to be configured on the agent alone), so you might have a
S3
storage + image pair to configure a flow. Up to you.
s
@Jim Crist-Harif Thanks for the quick response. I suspected my understanding of `cloudpickle`'s serialization was incorrect. Also, we were able to leverage all of your great work on
dask-gateway
on some other projects so I'm psyced to see you are working with Prefect. Thanks again.
👍 1
j
Nice, glad to hear it!
c
@Marvin archive “How are Flow dependencies pickled when using S3 storage?”