Hi :) I have a question about the flow-hash. When ...
# ask-community
m
Hi :) I have a question about the flow-hash. When I change a method-signature from a method i use in a task, the hash of the flow is not changing. This results in a not reregistered flow if I use it with S3. More details are in the thread.
This example will generate the same hash, even if I change the commented lines (other_function with and without param):
Copy code
import prefect
from prefect import Flow, task


@task
def say(text):
    logger = prefect.context['logger']
    rof = other_function()
    # rof = other_function('t')
    <http://logger.info|logger.info>(f"text: {text} and rof: {rof}")


def other_function():
# def other_function(my_p):
    return 'any_string'


with Flow(
    "Test function signature hash"
) as flow:
    t1 = say('my_text')


if __name__ == '__main__':
    hash = flow.serialized_hash()
    print(hash)
It goes a little in this direction: https://prefect-community.slack.com/archives/CL09KU1K7/p1634800071022600 But it's not the same. I think the signatures of methods should be included in the normal flow-hash. Or do I have to calc by myself?
a
@Michael Hadorn a tricky question. I think you need to differentiate here between versioning and storage. When you register your flow and you build your flow storage at the registration time (the default is build=True:
flow.register(project_name="your_project", build=True)
), then all the changes you made to the flow will be reflected in the storage so that your flow will be run in the most up-to-date “version” of it. However, when it comes to Versioning, it will be by default incremented only if there is any change in the Flow’s metadata and structure - i.e. new tasks, edges, or changes in storage, run configuration, schedule. If you look at your flow, changing the input parameters on the method signature doesn’t change anything in the flow structure (
with Flow() as flow…
), which is why you see the same hash. So you would have to calculate a hash on your inputs yourself to change this behavior.
m
@Anna Geller Ok. Thanks a lot for your clarification and the really fast response. (And I didn't know about the build-flag in register()). But what confuses me is that my flow is in a docker image (docker run). There I changed this method. I understand that the register will not add a version on the prefect-server. But when I run the flow, it raises an exception at the line of the method call complaining about the wrong amount of parameters. For me, that shows that this flow data (incl. code of the tasks) is in the storage. So it should be reflected by the hash. Or why do we not hash the full pickled storage? Or my question the other way round: For which logic in prefect is my real python code executed (inside the docker image) and which parts can be restored from the storage? So if my flow is not needed I could remove these files from the image which builds the flow and only add those which are used for my task execution.
a
Regarding this:
But when I run the flow, it raise an Exception at the line of the method call complaning about wrong amount of parameters.
The exception is raised at runtime based on runtime conditions. The flow’s structure and its metadata is registered at build time.
For me that shows that this flow data (incl. code of the tasks) is in the storage.
Yes, if you use the default Docker storage (did I understand correctly, you use Docker storage?), then the flow’s Docker image gets built every time you register a flow, regardless of whether it would result in a new version or not, i.e. even if nothing changed in your flow, the storage still gets built.
Or why do we not hash the full pickled storage?
We hash the serialized flow, not the storage. Hashing pickled storage would be difficult, especially because there are so many different storage options and that many users use script storage rather than pickle storage.
For which logic in prefect is my real python code executed (inside the docker image)?
It depends how you defined your Docker storage: • this example uses the default behavior where flow gets pickled and baked into your Docker image that gets built every time during registration: https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows/docker_pickle_docker_run_local_image.py • this example uses script storage without building the flow’s Docker image at registration; here the user copied all flows directly into the image beforehand, and only points to the path inside the image where the flow file has been stored within the image: https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows_no_build/docker_script_docker_run_local_image.py
m
@Anna Geller Thanks a lot again! We do not use the docker storage, we use a docker run with S3-storage. All my problems explained above happen because of S3. Sorry I did not communicate that clearly... 😕 Maybe I should switch the docker script solution (from your repo). I guess then will my problem be solved, because nothing is pickled (and potential not up to date).
a
Yes, probably script storage would help here. But using S3, you can use script storage as well, here is an example: https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows/s3_kubernetes_run.py#L11 Let’s analyze it a bit: 1. Does your flow script changes frequently? 2. Do your flow dependencies (custom modules or third party Python packages) change frequently? If only #1 is true, then S3 storage is perfectly fine. But if #2 is true, perhaps it’s easier to switch to Docker storage so that you can build and push a Docker image every time you register.
1
m
@Anna Geller Ah well, I did not manage to run s3 with script storage, because I couldn't set a valid local_script_path. Now I see how it should be done in your script. And yes, you're absolutley right. In our case, I should use #2. Have to test this today. Really, thank you very much for the kind support!!! 😄 It's much much better than other services we even pay for it. Prefect (and its community) is awesome!
🙌 1