I am trying to understand the correct way to separate my reg Prefect Community #ask-community

I am trying to understand the correct way to separ...

Billy McMonagle

10/13/2021, 10:48 PM

I am trying to understand the correct way to separate my registration process from my flow code. I'm using S3 storage and KubernetesRun, with a custom docker image. Because the flow dependencies are not available in my build environment, I feel like I'm in a catch22 situation... my register script is failing on import errors for modules that are not required during the build/register process at all. Could you please offer some conceptual advice here? Happy to share code although I'm not sure which parts are pertinent.

Billy McMonagle

10/13/2021, 10:48 PM

The other side of the catch22 is that I'm trying to keep the register code removed from flow code so that it can be reused for a bunch of flows (they all have common storage and runconfigs).

Billy McMonagle

10/13/2021, 11:22 PM

Here is a simplified project structure. I have directories for each flow (eg

test_flow

), each containing a Dockerfile, a

requirements.txt

file, and the flow itself. At the top level is the shared registration code,

register.py

Billy McMonagle

10/13/2021, 11:23 PM

Copy code

from test_flow import myflow

PROJECT_NAME = "my-project"

if __name__ == "__main__":
    print(f"Registering flow {myflow.name}")
    myflow.register(
        project_name=PROJECT_NAME, labels=["my-label"],
    )

test_flow/myflow.py:

Copy code

from prefect import Flow
from prefect.utilities.tasks import task

import pandas


@task
def my_task(arg):
    # pretend like this does something with pandas
    print(f"my arg is: {arg}")


with Flow("My Test Flow") as flow:
    my_task(arg="foo")
    my_task(arg="bar")

Billy McMonagle

10/13/2021, 11:23 PM

We can ignore Dockerfile and requirements.txt, they don't matter - but I do NOT want to have to install requirements.txt in my build environment (AWS CodeBuild, if it matters), because flows can potentially have conflicting dependencies. (not shown here, but the reasons for using custom docker images is private package repo plus some flow specific non-python stuff like .sql files)

Billy McMonagle

10/13/2021, 11:24 PM

Running

register.py

yields the following result...

Copy code

❯ python register.py
Traceback (most recent call last):
  File "register.py", line 1, in <module>
    from test_flow import myflow
  File "[...]/flows/test_flow/myflow.py", line 4, in <module>
    import pandas
ModuleNotFoundError: No module named 'pandas'

Billy McMonagle

10/13/2021, 11:26 PM

One possible solution I can think of would be to try and do the registration from within the Dockerfile itself. This seems doable to me but I'd like to know if it's considered a good or bad practice. I believe I'd need to switch to github storage because I can't put aws auth inside the Dockerfile.

Billy McMonagle

10/13/2021, 11:27 PM

Anyway hope that's enough to get started, thanks in advance 🙏

Kevin Kho

10/14/2021, 5:43 AM

Hey @Billy McMonagle, I think you have two options you can try here. The first is to put the

import

statements inside the tasks. This will defer imports upon execution and you can register without them. The second option is to store your flow as a script (think S3, Github storages where there is no serialization). I know you are using Docker storage, but you can still

stored_as_script=True

inside a Docker storage and this might not serialize the Python script so those dependencies won’t be needed during build time. As to good practice or not, a lot of users defer their import because they have CI/CD pipelines with a specified build image and that pipeline may not have the requirements, but the execution environment does, so some people do indeed to this. Does that help?

👍 1

upvote 2

emre

10/14/2021, 7:17 AM

Hey, Kevin's suggestions seem like the simplest ones to convert to. However, I've used your register in Docker idea a couple of times successfully. Here's my 2c: • You can serialize your flow inside a docker container, and have it available in AWS CodeBuild if you use a docker volume. From there, uploading to S3 is trivial. • Register could also be run from inside the docker container, after the serialized flow is stored in S3. • Your register code would have to somehow be in your docker image, but you could have your registration logic be a pip installable library, and seperate them logically at least. • Since docker image ends up having all these responsiblities, I usually implement a simple CLI to my docker image to handle these requests.

upvote 1

👍 1

Billy McMonagle

10/14/2021, 2:38 PM

Thanks @Kevin Kho! I agree that doing

import

inside tasks would work (feels weird, but probably fine). I don't think

stored_as_script=True

solves my problem because I still have to import the flow object, which obviously would execute

import

statements in the flow definition file. Maybe you just meant this is a useful option in conjunction with the aforementioned "defer imports" advice.

Billy McMonagle

10/14/2021, 2:46 PM

@emre Thanks for your thoughts. I'm not sure how to mount docker volumes inside codebuild but I can look into this. I agree that a pip installable registration library would be nice, I was thinking maybe I would put it in a base image that all of my flow images inherit from. It seems to me that register would run inside the docker container, I could set the S3 storage

key

but not set the

local_script_path

, meaning that I would simply upload the file myself rather than allow the register script to do it (I will not make authenticated aws sdk calls from inside the docker image as it is being built).

👍 1

Billy McMonagle

10/14/2021, 2:47 PM

Curious if you are willing to say any more about the register docker CLI you've implemented, that sounds nice.

emre

10/14/2021, 6:01 PM

Its as simple as it gets, I just have a basic cli:

main.py serialize

and

main.py register

, which calls flow.serialize and flow.register, almost nothing in terms of configuration. Its just a simple way to make the most out of the docker image that I know can set up my flow, without any of the issues you are facing.

Billy McMonagle

10/14/2021, 6:06 PM

OK nice!

Billy McMonagle

10/14/2021, 6:06 PM

Yeah I spoke with a devops colleague today and talked through these ideas and I think we are moving 🚀

🦜 1

Billy McMonagle

10/18/2021, 6:07 PM

Follow up on the above... I've got the basics working now. I'm working with a fully custom docker image, and I ended up copying not only the dependencies but the flow itself into the image so that the flow could be registered at build time. At this point, I decided that S3 storage was not providing any actual utility, so I switched to docker storage. This seems to be working well.

👍 1

Billy McMonagle

10/18/2021, 6:09 PM

The end of my Dockerfile now looks like this:

Copy code

RUN prefect auth login --key $PREFECT_TOKEN
RUN python register.py --flow $APP_HOME/flows/myflow.py

And my storage looks like this (from

register.py

Copy code

Docker(
    path=flow_path,
    stored_as_script=True,
    image_name="$accountid.dkr.ecr.$<http://region.amazonaws.com/$image|region.amazonaws.com/$image>",
    image_tag="$tag",
    )

Kevin Kho

10/18/2021, 6:09 PM

Sounds good!

Billy McMonagle

10/18/2021, 6:10 PM

Thanks @Kevin Kho! Was hoping to verify that nothing looks too unorthodox.

Billy McMonagle

10/18/2021, 6:10 PM

One thing I found strange... it seems like I have to specify the image/tag in both the docker storage and the kubernetes run configuration, which seems like it should be unnecessary.

Kevin Kho

10/18/2021, 6:11 PM

What happens if you don’t specify it?

Billy McMonagle

10/18/2021, 6:18 PM

Docker storage is missing required fields image_name and image_tag

Billy McMonagle

10/18/2021, 6:19 PM

Interestingly, I think have to also do

flow.storage.add_flow(flow)

, or else I get this error:

Failed to load and execute Flow's environment: ValueError('Flow is not contained in this Storage')

This is because I pass

build=False

, since the image is already built.

7 Views

Open in Slack

Previous Next