Billy McMonagle

    Billy McMonagle

    11 months ago
    I am trying to understand the correct way to separate my registration process from my flow code. I'm using S3 storage and KubernetesRun, with a custom docker image. Because the flow dependencies are not available in my build environment, I feel like I'm in a catch22 situation... my register script is failing on import errors for modules that are not required during the build/register process at all. Could you please offer some conceptual advice here? Happy to share code although I'm not sure which parts are pertinent.
    The other side of the catch22 is that I'm trying to keep the register code removed from flow code so that it can be reused for a bunch of flows (they all have common storage and runconfigs).
    Here is a simplified project structure. I have directories for each flow (eg
    test_flow
    ), each containing a Dockerfile, a
    requirements.txt
    file, and the flow itself. At the top level is the shared registration code,
    register.py
    .
    register.py:
    from test_flow import myflow
    
    PROJECT_NAME = "my-project"
    
    if __name__ == "__main__":
        print(f"Registering flow {myflow.name}")
        myflow.register(
            project_name=PROJECT_NAME, labels=["my-label"],
        )
    test_flow/myflow.py:
    from prefect import Flow
    from prefect.utilities.tasks import task
    
    import pandas
    
    
    @task
    def my_task(arg):
        # pretend like this does something with pandas
        print(f"my arg is: {arg}")
    
    
    with Flow("My Test Flow") as flow:
        my_task(arg="foo")
        my_task(arg="bar")
    We can ignore Dockerfile and requirements.txt, they don't matter - but I do NOT want to have to install requirements.txt in my build environment (AWS CodeBuild, if it matters), because flows can potentially have conflicting dependencies. (not shown here, but the reasons for using custom docker images is private package repo plus some flow specific non-python stuff like .sql files)
    Running
    register.py
    yields the following result...
    python register.py
    Traceback (most recent call last):
      File "register.py", line 1, in <module>
        from test_flow import myflow
      File "[...]/flows/test_flow/myflow.py", line 4, in <module>
        import pandas
    ModuleNotFoundError: No module named 'pandas'
    One possible solution I can think of would be to try and do the registration from within the Dockerfile itself. This seems doable to me but I'd like to know if it's considered a good or bad practice. I believe I'd need to switch to github storage because I can't put aws auth inside the Dockerfile.
    Anyway hope that's enough to get started, thanks in advance 🙏
    Kevin Kho

    Kevin Kho

    11 months ago
    Hey @Billy McMonagle, I think you have two options you can try here. The first is to put the
    import
    statements inside the tasks. This will defer imports upon execution and you can register without them. The second option is to store your flow as a script (think S3, Github storages where there is no serialization). I know you are using Docker storage, but you can still
    stored_as_script=True
    inside a Docker storage and this might not serialize the Python script so those dependencies won’t be needed during build time. As to good practice or not, a lot of users defer their import because they have CI/CD pipelines with a specified build image and that pipeline may not have the requirements, but the execution environment does, so some people do indeed to this. Does that help?
    emre

    emre

    11 months ago
    Hey, Kevin's suggestions seem like the simplest ones to convert to. However, I've used your register in Docker idea a couple of times successfully. Here's my 2c: • You can serialize your flow inside a docker container, and have it available in AWS CodeBuild if you use a docker volume. From there, uploading to S3 is trivial. • Register could also be run from inside the docker container, after the serialized flow is stored in S3. • Your register code would have to somehow be in your docker image, but you could have your registration logic be a pip installable library, and seperate them logically at least. • Since docker image ends up having all these responsiblities, I usually implement a simple CLI to my docker image to handle these requests.
    Billy McMonagle

    Billy McMonagle

    11 months ago
    Thanks @Kevin Kho! I agree that doing
    import
    inside tasks would work (feels weird, but probably fine). I don't think
    stored_as_script=True
    solves my problem because I still have to import the flow object, which obviously would execute
    import
    statements in the flow definition file. Maybe you just meant this is a useful option in conjunction with the aforementioned "defer imports" advice.
    @emre Thanks for your thoughts. I'm not sure how to mount docker volumes inside codebuild but I can look into this. I agree that a pip installable registration library would be nice, I was thinking maybe I would put it in a base image that all of my flow images inherit from. It seems to me that register would run inside the docker container, I could set the S3 storage
    key
    but not set the
    local_script_path
    , meaning that I would simply upload the file myself rather than allow the register script to do it (I will not make authenticated aws sdk calls from inside the docker image as it is being built).
    Curious if you are willing to say any more about the register docker CLI you've implemented, that sounds nice.
    emre

    emre

    11 months ago
    Its as simple as it gets, I just have a basic cli:
    main.py serialize
    and
    main.py register
    , which calls flow.serialize and flow.register, almost nothing in terms of configuration. Its just a simple way to make the most out of the docker image that I know can set up my flow, without any of the issues you are facing.
    Billy McMonagle

    Billy McMonagle

    11 months ago
    OK nice!
    Yeah I spoke with a devops colleague today and talked through these ideas and I think we are moving 🚀
    Follow up on the above... I've got the basics working now. I'm working with a fully custom docker image, and I ended up copying not only the dependencies but the flow itself into the image so that the flow could be registered at build time. At this point, I decided that S3 storage was not providing any actual utility, so I switched to docker storage. This seems to be working well.
    The end of my Dockerfile now looks like this:
    RUN prefect auth login --key $PREFECT_TOKEN
    RUN python register.py --flow $APP_HOME/flows/myflow.py
    And my storage looks like this (from
    register.py
    ):
    Docker(
        path=flow_path,
        stored_as_script=True,
        image_name="$accountid.dkr.ecr.$<http://region.amazonaws.com/$image|region.amazonaws.com/$image>",
        image_tag="$tag",
        )
    Kevin Kho

    Kevin Kho

    11 months ago
    Sounds good!
    Billy McMonagle

    Billy McMonagle

    11 months ago
    Thanks @Kevin Kho! Was hoping to verify that nothing looks too unorthodox.
    One thing I found strange... it seems like I have to specify the image/tag in both the docker storage and the kubernetes run configuration, which seems like it should be unnecessary.
    Kevin Kho

    Kevin Kho

    11 months ago
    What happens if you don’t specify it?
    Billy McMonagle

    Billy McMonagle

    11 months ago
    Docker storage is missing required fields image_name and image_tag
    Interestingly, I think have to also do
    flow.storage.add_flow(flow)
    , or else I get this error:
    Failed to load and execute Flow's environment: ValueError('Flow is not contained in this Storage')
    This is because I pass
    build=False
    , since the image is already built.