I m completely new to Prefect and had some questions about p Prefect Community #ask-community

I'm completely new to Prefect and had some questio...

Sean Harkins

11/18/2020, 5:12 PM

I'm completely new to Prefect and had some questions about pickling dependencies for containerized execution environments. We are working on a project that uses a thin wrapper over Prefect Flows to include some additional input and output information https://github.com/developmentseed/example-pipeline/blob/fargate_test/recipe/pipeline.py. My question is about how

cloudpickle

serializes my Flow's dependencies with S3 storage. Based on some information outlined here https://prefect-community.slack.com/archives/C014Z8DPDSR/p1605200879483400 I've got S3 storage configured and working but it seems that my Flow's upstream dependencies are not pickled when I register the flow as indicated in https://docs.prefect.io/core/advanced_tutorials/task-guide.html#task-inputs-and-outputs. Specifically I receive the following dependency error

Unexpected error: ModuleNotFoundError("No module named 'h5netcdf'")

. I can build an image with the necessary dependencies for use with the

DaskExecutor

and everything works correctly but our goal is to decouple our execution environment from Flows. Am I misunderstanding how Flow dependencies should be serialized by

cloudpickle

. Is there another approach I should be considering in this case? Thanks in advance.

Jim Crist-Harif

11/18/2020, 5:16 PM

Hi Sean, Python dependencies with Prefect have two parts: • Your flow and the tasks within it are usually all written in a single python file. A

Storage

class manages getting the contents of this to an execution environment. • The Python dependencies your code relies on (for example

h5netcdf

). These need to be handled separately by you using e.g. a docker image, conda environments, etc... If you want to decouple "execution environment" from flows I recommend having a superset of all dependencies needed by all flows in an environment - a prefect storage class has no way of moving around anything except your flow code itself.

Jim Crist-Harif

11/18/2020, 5:18 PM

In your case you might have a base image with all the dependencies you need, then use the

S3

storage to manage only your flow code. Then all flows share the same image.

Jim Crist-Harif

11/18/2020, 5:18 PM

Alternatively, flows can specify an image to use (it doesn't need to be configured on the agent alone), so you might have a

S3

storage + image pair to configure a flow. Up to you.

Sean Harkins

11/18/2020, 5:25 PM

@Jim Crist-Harif Thanks for the quick response. I suspected my understanding of `cloudpickle`'s serialization was incorrect. Also, we were able to leverage all of your great work on

dask-gateway

on some other projects so I'm psyced to see you are working with Prefect. Thanks again.

👍 1

Jim Crist-Harif

11/18/2020, 5:25 PM

Nice, glad to hear it!

Chris White

11/18/2020, 11:29 PM

@Marvin archive “How are Flow dependencies pickled when using S3 storage?”

Marvin

11/18/2020, 11:29 PM

https://github.com/PrefectHQ/prefect/issues/3684

5 Views

Open in Slack

Previous Next