Good morning I have a question about flow registration I d l Prefect Community #ask-community

Good morning, I have a question about flow registr...

Billy McMonagle

03/04/2021, 2:34 PM

Good morning, I have a question about flow registration. I'd like to register my flows during CI/CD (specifically, AWS Codebuild). My issue is that I have run into import errors, because my flow dependencies are not installed in the build environment. I am using

S3Storage

and

KubernetesRun

. Thanks in advance for any guidance.

Billy McMonagle

03/04/2021, 2:38 PM

I can share some code, but I think this is a pretty general problem. I can think of three solutions, but there may be more: (1) Install dependencies in build environment: I do not think this should be necessary, and different flows have different dependencies which may not be compatible. (2) Use something like the

extract_flow_from_file

utility function, but add a hook of some kind to remove the

import

statements. Haven't tried this. Seems like it should work, but also feels like a hack. (3) Put the flow registration commands in the Dockerfile for each flow.

Jim Crist-Harif

03/04/2021, 2:53 PM

I'd avoid option 2 if possible (it would work for script-based storage but not pickle based storage, and is a bit hacky). When registering a flow we currently need to have the full flow structure, so we do need to actually import your flow. If option 1 is too tricky for you, then executing the registration inside the same docker image they'll run in seems like a good option.

Sean Talia

03/04/2021, 2:55 PM

I'm doing something that's been working pretty nicely for my org thus far; we're using GitHub Actions to register our actions (using S3Storage, DockerRun config) to our cloud instance. For every flow that I want to register, I have a service in a

docker-compose.yml

that looks like this:

Copy code

example_flow:
  image: <EXAMPLE_FLOW_DOCKER_RUN_CONFIG_IMAGE>:latest
  volumes:
    - "./flows/example_flow/flow.py:/app/flow.py"
    - "./register_flow.sh:/app/register_flow.sh"
  command: "./register_flow.sh"

my github action is looking for changes in the

flows/example_flow/

path (i.e. where the flow code is hosted), and when it sees one, it pulls this compose service and then runs it. I use GitHub secrets in my workflow step to pass my cloud token as an environment variable into the container, and then it's all good to go

👍 2

Billy McMonagle

03/04/2021, 2:57 PM

OK great, thanks @Jim Crist-Harif. I had been building "toy" flows and focusing on the CI/CD process, looks like I should have built some "real" flows sooner so I could run into and resolve this problem. I got frustrated last night and thought up solution (3) this morning, so I'm glad you think it sounds reasonable.

Sean Talia

03/04/2021, 2:57 PM

the fact that i'm using the runconfig image as the compose service's image is key, because by definition it has all the flow's dependencies pre-installed in it

Billy McMonagle

03/04/2021, 2:57 PM

That's clever @Sean Talia, I'm going to look into this and see if it works with my setup. Thank you.

👍 1

Sean Talia

03/04/2021, 2:58 PM

so I don't need to worry about any extra considerations at all, I just need to make sure I get the flow code and the appropriate prefect tokens mounted into it when the service is spun up...and then that

register_flow.sh

script is incredibly lightweight, all it does is

prefect auth login --token $PREFECT_TENANT_TOKEN

and then calls the python script that registers the flow

Billy McMonagle

03/04/2021, 2:59 PM

Yeah that makes sense.

Billy McMonagle

03/04/2021, 2:59 PM

I'm trying to think through how you are telling the registration service where to look for flows.

Sean Talia

03/04/2021, 3:02 PM

if I'm interpreting your question correctly, (sorry I should have mentioned this), the first step of my GA workflow is using the publicly available

actions/checkout@v2

action, so this makes it incredibly easy because it's actually just checking out my entire repository onto the host that's been provisioned to run the GA workflow

Billy McMonagle

03/04/2021, 3:04 PM

Gotcha. I've never used github actions for anything non-trivial, but I think I understand what you mean. Watching for code changes on a specific path would let you do that.

👍 1

Sean Talia

03/04/2021, 3:04 PM

so when it checks out my whole repo and it comes time to register the flow, I've got my flow code, my

register_flow.sh

script, and my

docker-compose.yml

all available on the machine, and that's all that's needed

Billy McMonagle

03/04/2021, 3:05 PM

Guessing this could be done from the CI/CD itself. I don't necessarily want to give github permission to pull my images from ECR.

Sean Talia

03/04/2021, 3:05 PM

yeah setting this up was actually my first foray into using github actions for anything non-trivial but it's turned out to be very easy to work with 😄

Sean Talia

03/04/2021, 3:06 PM

ahhh well i didn't want to delve into unnecessary detail but I didn't want to do that either, so we're using a self-hosted GA runner (i.e. an EC2 instance that we own, that's running the github runner application on the instance and run workflows that are associated with our repo)

Sean Talia

03/04/2021, 3:07 PM

and i've given that EC2 instance an instance profile w/ policies that will let it talk to my container registry, S3 buckets, etc. etc.

Billy McMonagle

03/04/2021, 3:07 PM

oooo interesting. TIL you can do that with GA.

Billy McMonagle

03/04/2021, 3:07 PM

(And I totally understand, there's a ton of detail you have to leave out in order to communicate clearly).

Sean Talia

03/04/2021, 3:08 PM

i am by no means a GA, Prefect, or AWS expert, I'm just wading my way through the muck like many others out there and I've stumbled on something that seems to be working well

👍 1

Billy McMonagle

03/04/2021, 3:08 PM

Ha yeah, I understand.

Jim Crist-Harif

03/04/2021, 3:11 PM

Curious, what's your experience with self-hosting a GA runner? I have a few related setups that might benefit from this.

Sean Talia

03/04/2021, 3:17 PM

I've largely had a good experience with it so far – my one gripe is that the GA runner application by default will auto-update itself to whatever the newest version of the runner is; I think it will wait about a week after the release, and if it notices that it's still running an older version, it will shut down your actively running GA runner application, fetch the latest, install it, and bring up the newest version

Sean Talia

03/04/2021, 3:17 PM

there are a couple of threads where a lot of people complain about this behavior, and there doesn't seem to be an easy way to turn it off: https://github.com/actions/runner/issues/485

Sean Talia

03/04/2021, 3:19 PM

we're using it for now because i think it still is a net positive (it definitely beats having to use a github-provisioned server and then needing to pass container registry + prefect credentials to it), but I am open to the possibility of my org saying that the auto-update thing is a no-go

Jim Crist-Harif

03/04/2021, 3:20 PM

Thanks, this is all useful info.

Billy McMonagle

03/04/2021, 3:22 PM

@Sean Talia aside from GA, what CI/CD are you using?

Sean Talia

03/04/2021, 3:23 PM

you mean specifically for this use case with prefect? or just in general

Billy McMonagle

03/04/2021, 3:24 PM

if they are different, i'm interested to know why... but I do mean with prefect, yea.

Sean Talia

03/04/2021, 3:28 PM

a lot of the rest of my org uses jenkins for ci/cd stuff – our infra/devops team is pushing getting off of it and using github actions for everything. I don't think I'd be able to articulate very well why they made that decision, since it's not really my wheelhouse, but the fact that someone like me can do some non-trivial stuff pretty easily might be evidence in and of itself of why that's the direction they want to go in? My specific team (data engineering) has basically moved exclusively to using GA though – I don't think I've logged into our jenkins UI in a good 4-5 months

Sean Talia

03/04/2021, 3:29 PM

but in this use case there aren't any additional tools I'm using specifically for the CI/CD stuff w/ prefect...there's still some manual setup going on (e.g. I manually provisioned the EC2 instance that the github action runner is installed on using terraform and manually configured it using ansible before it was ready to go)

Billy McMonagle

03/04/2021, 3:31 PM

that is very interesting. i'm going to look into the GA thing, if only to be more aware of that option.

Billy McMonagle

03/04/2021, 3:32 PM

We have a very small devops team, and a small data eng team, and we pretty much exclusively use AWS codebuild/codepipeline, with assorted github actions for things like merging staging branches.

Billy McMonagle

03/04/2021, 3:32 PM

of course, everything is just shell scripts when it comes down to it so who cares what system it runs on, I guess.

Sean Talia

03/04/2021, 3:33 PM

it's turtles all the way down

Sean Talia

03/04/2021, 3:34 PM

I just have to re-iterate my disclaimer that I am not a professional

Sean Talia

03/04/2021, 3:34 PM

I don't know how gauge my setup other than by noting that it's working for our use cases so far 😇

Billy McMonagle

03/04/2021, 3:37 PM

Anyone who says otherwise is lying as far as I'm concerned haha

💯 2

Adam

03/04/2021, 5:16 PM

@Billy McMonagle you should check out https://github.com/PrefectHQ/prefect/discussions/4042

Billy McMonagle

03/04/2021, 5:35 PM

thanks @Adam! I've actually posted in that discussion, although in part due to feedback from others in the community my setup has since evolved 🙂

Hawkar Mahmod

04/13/2021, 2:36 PM

@Billy McMonagle - did you end up going for option 3 from above? I don’t fully understand what you mean by “Put the flow registration commands in the Dockerfile for each flow.” I have one Dockerfile containing common Python dependencies. I’d prefer to only build this file if one of those dependencies changes. Like you I’m using CodePipeline/CodeBuild. This approach you suggest seems to have to require a rebuild from that Dockerfile each time we wish to register our flow.

Billy McMonagle

04/14/2021, 2:48 PM

@Hawkar Mahmod I wish I had more detail for you, unfortunately I haven't been making a ton of progress on my prefect setup. Yes, I am rebuilding each Dockerfile in CI/CD, which re-registers the flow. However, I'm utilizing the latest docker buildkit caching functionality, which makes the builds extremely fast if nothing has changed.

Hawkar Mahmod

04/14/2021, 2:50 PM

Ok sweet, that’s useful to know. How do you manage to utilise that across builds if the environment is ephemeral. Sorry I’m rather new to the use of CI/CD environments.

Billy McMonagle

04/14/2021, 2:50 PM

a totally reasonable question, it is not super obvious, especially since it's a fairly new (less than 1 year) feature. let me grab a snippet

Billy McMonagle

04/14/2021, 2:53 PM

here is a makefile i wrote to do the docker builds. this is not heavily battle tested but does seem to work so far.

Copy code

# list all flows here
FLOWS := \
  flow1
  flow2
  etc

build/%:
	@echo "building flows/${@F}"
	REPO=${REGISTRY}/${APP}/${@F} && \
	docker build --file flows/${@F}/Dockerfile . \
		--cache-from $$REPO:${GIT_SHA} \
		--cache-from $$REPO:${GIT_BRANCH} \
		--build-arg BUILDKIT_INLINE_CACHE=1 \
		--build-arg APP=${APP} \
		--build-arg AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID} \
		--build-arg AWS_REGION=${AWS_REGION} \
		--build-arg BUCKET_NAME=${BUCKET_NAME} \
		--build-arg FLOW_NAME=${@F} \
		--build-arg GIT_BRANCH=${GIT_BRANCH} \
		--build-arg GIT_SHA=${GIT_SHA} \
		--build-arg HELM_CHART=${HELM_CHART} \
		--build-arg HELM_RELEASE=${HELM_RELEASE} \
		--build-arg PREFECT_VERSION=${PREFECT_VERSION} \
		--build-arg PYTHON_RUNTIME=${PYTHON_RUNTIME} \
		--build-arg SSM_ENV=${SSM_ENV} \
		--secret id=prefect_token,src=prefect_token
		--tag $$REPO:${GIT_SHA} \
		--tag $$REPO:${GIT_BRANCH}

push/%:
	docker push "${REGISTRY}/${APP}/${@F}:${GIT_SHA}"
	docker push "${REGISTRY}/${APP}/${@F}:${GIT_BRANCH}"

## run targets for all flows
all: build push
build: $(foreach flow,${FLOWS},build/${flow})
push: $(foreach flow,${FLOWS},push/${flow})

.PHONY: all build push

Billy McMonagle

04/14/2021, 2:55 PM

there are many ways to configure this, but especially with a single dockerfile containing all dependencies, you should be able to set

BUILDKIT_INLINE_CACHE=1

and use the

--cache-from

argument pointed at your remote ECR repository and get some good results.

Hawkar Mahmod

04/14/2021, 2:59 PM

Ah I’ve never used a Makefile before but I am aware of them and I’m sure I’ll figure it out. This is just what I needed I think, puts me in the right direction. Thank you.

Billy McMonagle

04/14/2021, 3:00 PM

awesome, happy to help! my main reasoning to use the makefile is bc I intend to have many dockerfiles which I'd like to build in parallel. if this is not your use case you may just want to have a slightly simpler bash script (makefile syntax is WEIRD). fortunately i have a smart devops person i was able to copy much of my build scripting from.

Hawkar Mahmod

04/15/2021, 7:52 AM

A smart devops person, that’s a precious and rare resource 😄. Finally looking into this properly and am very satisfied with the solution. I already have a flow registration script that goes through my flows but I didn’t have the building part down. I presume the Dockerfiles for each flow contain the

register

call. But that does leave me a bit puzzled with the choice of Storage. Having read this thread and others there seems to be a recommendation to move away from the use of Docker Storage in favour of GitHub or S3. I chose to move to S3 Storage because I saw that the registration time was drastically reduced. However this deployment procedure you’ve used loses some of that benefit because you have to rebuild this image upon each deployment. The only reasons I think of to prefer S3 Storage over Docker Storage now are: a) Flexibility to change and put live flow code without a full deployment. You can just call

register

locally and have the Docker image from the existing deployment be used as your execution environment. b) To avoid the Docker-in-Docker scenario that would require you to build a Docker image, only to build another one upon calling

register

with

Docker

Storage.

7 Views

Open in Slack

Previous Next