https://prefect.io logo
Title
a

Adithya Ramanathan

11/17/2020, 3:07 PM
Hi all, Had a quick question trying to better understand the difference/relationship between the object that we store in as a
.prefect
file in storage when submitting a flow to the server, versus the serialized variant of a flow that is used to register with the GraphQL API? To motivate the reasoning, we have a need to directly invoke the GraphQL API, and are therefore manually serializing flows ourselves. As per our current understanding, we also need to build the
.prefect
file in storage and make sure that is referenced in the serialization, but given the fact that the
tasks
are captured in the serialized flow anyway, we were curious as to what the
.prefect
file is used for, what other information is captured there? Mainly trying to understand why, rather than perhaps build a work around of any kind. Thanks in advance!
@David Harrington @Payam Vaezi
j

josh

11/17/2020, 3:18 PM
Hi @Adithya Ramanathan the
.prefect
file is a flow that is pickled using cloudpickle. Essentially it is bytecode returned from
cloudpickle.dump
. Take a look at the flow’s save function that does this. Also worth pointing out that if you’re calling the GraphQL API directly to invoke flow registration and performing serialization yourself there’s a possibility that you may be able to benefit from storing your flow as a script instead of a pickle https://docs.prefect.io/core/idioms/file-based.html 🙂
a

Adithya Ramanathan

11/17/2020, 3:21 PM
Hi Josh! Thanks for the quick response! I think the concept we’re struggling with is that the serialization of the flow contains almost the entirety of information, seemingly, needed to construct a flow. Why does
prefect
both serialize and `cloudpickle.dump`/store as a python script? Why doesn’t the agent get/use the serialization itself to execute the flow?
j

josh

11/17/2020, 3:28 PM
Running your flow with a Prefect backend API (server/cloud) runs on a hybrid model that separates the contents of a flow with the metadata that represents it. This ensures that any actual data/sensitive information/processes/etc. are completely independent of the orchestration layer. Serialization is the metadata representation that the backend can use to represent and orchestrate the flow. The agent does use the flow’s serialized metadata to execute the flow however when the flow goes to run it is expected that the structure of the flow matches that in the serialization. For example if I have an ETL flow with three tasks then when the flow goes to run the backend orchestrates the flow based on the three ETL tasks that it knows about. If, for some reason, when the flow is loaded from storage and it doesn’t match the structure that it was serialized with then the backend won’t know what needs to be orchestrated and will require the flow to be reregistered. When you use something like file-based storage it makes it a bit easier to update the contents of your tasks in the flow without needing reregistration however the structure must still remain the same. We have pitched out an idea for dynamically registered flows at runtime however it hasn’t been developed yet.
a

Adithya Ramanathan

11/17/2020, 3:59 PM
Ahhh got it, that makes a lot of sense! Thank you for that background!
Sorry, @josh one last question, in that scenario, where the registered flow serialization is compared against the pickle file, who does the validation? The agent or the compute resource in the cluster that the agent has shipped the work to?
j

josh

11/17/2020, 4:12 PM
The compute resource that actually runs the flow. The agent only cares about some of the information like environment, storage, flow run info, etc.
👍 1
a

Adithya Ramanathan

11/17/2020, 4:12 PM
Sweet - thanks again!
or actually, in our scenario, we are running in K8s with a Kubernetes agent, but using s3 storage with the flow, I was trying to better understand where the retrieval of the flow happened. The edit you made to your response confused me a little bit as you said the agent cares about storage?
j

josh

11/17/2020, 4:16 PM
Ah haha ignore my edit. The agent cares about some of the metadata surrounding the storage and environment. e.g. if I store my flow in Docker storage then the k8s agent cares about the image repo/name/tag in order to create the prefect k8s job. When storing it in s3 the k8s agent cares about the image in the environment metadata
a

Adithya Ramanathan

11/17/2020, 4:17 PM
Cool - makes sense! Thanks for all your help!