I am trying to understand the correct way to separ...
# ask-community
b
I am trying to understand the correct way to separate my registration process from my flow code. I'm using S3 storage and KubernetesRun, with a custom docker image. Because the flow dependencies are not available in my build environment, I feel like I'm in a catch22 situation... my register script is failing on import errors for modules that are not required during the build/register process at all. Could you please offer some conceptual advice here? Happy to share code although I'm not sure which parts are pertinent.
The other side of the catch22 is that I'm trying to keep the register code removed from flow code so that it can be reused for a bunch of flows (they all have common storage and runconfigs).
Here is a simplified project structure. I have directories for each flow (eg
test_flow
), each containing a Dockerfile, a
requirements.txt
file, and the flow itself. At the top level is the shared registration code,
register.py
.
register.py:
Copy code
from test_flow import myflow

PROJECT_NAME = "my-project"

if __name__ == "__main__":
    print(f"Registering flow {myflow.name}")
    myflow.register(
        project_name=PROJECT_NAME, labels=["my-label"],
    )
test_flow/myflow.py:
Copy code
from prefect import Flow
from prefect.utilities.tasks import task

import pandas


@task
def my_task(arg):
    # pretend like this does something with pandas
    print(f"my arg is: {arg}")


with Flow("My Test Flow") as flow:
    my_task(arg="foo")
    my_task(arg="bar")
We can ignore Dockerfile and requirements.txt, they don't matter - but I do NOT want to have to install requirements.txt in my build environment (AWS CodeBuild, if it matters), because flows can potentially have conflicting dependencies. (not shown here, but the reasons for using custom docker images is private package repo plus some flow specific non-python stuff like .sql files)
Running
register.py
yields the following result...
Copy code
❯ python register.py
Traceback (most recent call last):
  File "register.py", line 1, in <module>
    from test_flow import myflow
  File "[...]/flows/test_flow/myflow.py", line 4, in <module>
    import pandas
ModuleNotFoundError: No module named 'pandas'
One possible solution I can think of would be to try and do the registration from within the Dockerfile itself. This seems doable to me but I'd like to know if it's considered a good or bad practice. I believe I'd need to switch to github storage because I can't put aws auth inside the Dockerfile.
Anyway hope that's enough to get started, thanks in advance 🙏
k
Hey @Billy McMonagle, I think you have two options you can try here. The first is to put the
import
statements inside the tasks. This will defer imports upon execution and you can register without them. The second option is to store your flow as a script (think S3, Github storages where there is no serialization). I know you are using Docker storage, but you can still
stored_as_script=True
inside a Docker storage and this might not serialize the Python script so those dependencies won’t be needed during build time. As to good practice or not, a lot of users defer their import because they have CI/CD pipelines with a specified build image and that pipeline may not have the requirements, but the execution environment does, so some people do indeed to this. Does that help?
👍 1
upvote 2
e
Hey, Kevin's suggestions seem like the simplest ones to convert to. However, I've used your register in Docker idea a couple of times successfully. Here's my 2c: • You can serialize your flow inside a docker container, and have it available in AWS CodeBuild if you use a docker volume. From there, uploading to S3 is trivial. • Register could also be run from inside the docker container, after the serialized flow is stored in S3. • Your register code would have to somehow be in your docker image, but you could have your registration logic be a pip installable library, and seperate them logically at least. • Since docker image ends up having all these responsiblities, I usually implement a simple CLI to my docker image to handle these requests.
upvote 1
👍 1
b
Thanks @Kevin Kho! I agree that doing
import
inside tasks would work (feels weird, but probably fine). I don't think
stored_as_script=True
solves my problem because I still have to import the flow object, which obviously would execute
import
statements in the flow definition file. Maybe you just meant this is a useful option in conjunction with the aforementioned "defer imports" advice.
@emre Thanks for your thoughts. I'm not sure how to mount docker volumes inside codebuild but I can look into this. I agree that a pip installable registration library would be nice, I was thinking maybe I would put it in a base image that all of my flow images inherit from. It seems to me that register would run inside the docker container, I could set the S3 storage
key
but not set the
local_script_path
, meaning that I would simply upload the file myself rather than allow the register script to do it (I will not make authenticated aws sdk calls from inside the docker image as it is being built).
👍 1
Curious if you are willing to say any more about the register docker CLI you've implemented, that sounds nice.
e
Its as simple as it gets, I just have a basic cli:
main.py serialize
and
main.py register
, which calls flow.serialize and flow.register, almost nothing in terms of configuration. Its just a simple way to make the most out of the docker image that I know can set up my flow, without any of the issues you are facing.
b
OK nice!
Yeah I spoke with a devops colleague today and talked through these ideas and I think we are moving 🚀
🦜 1
Follow up on the above... I've got the basics working now. I'm working with a fully custom docker image, and I ended up copying not only the dependencies but the flow itself into the image so that the flow could be registered at build time. At this point, I decided that S3 storage was not providing any actual utility, so I switched to docker storage. This seems to be working well.
👍 1
The end of my Dockerfile now looks like this:
Copy code
RUN prefect auth login --key $PREFECT_TOKEN
RUN python register.py --flow $APP_HOME/flows/myflow.py
And my storage looks like this (from
register.py
):
Copy code
Docker(
    path=flow_path,
    stored_as_script=True,
    image_name="$accountid.dkr.ecr.$<http://region.amazonaws.com/$image|region.amazonaws.com/$image>",
    image_tag="$tag",
    )
k
Sounds good!
b
Thanks @Kevin Kho! Was hoping to verify that nothing looks too unorthodox.
One thing I found strange... it seems like I have to specify the image/tag in both the docker storage and the kubernetes run configuration, which seems like it should be unnecessary.
k
What happens if you don’t specify it?
b
Docker storage is missing required fields image_name and image_tag
Interestingly, I think have to also do
flow.storage.add_flow(flow)
, or else I get this error:
Failed to load and execute Flow's environment: ValueError('Flow is not contained in this Storage')
This is because I pass
build=False
, since the image is already built.