Good morning, I have a question about flow registr...
# ask-community
b
Good morning, I have a question about flow registration. I'd like to register my flows during CI/CD (specifically, AWS Codebuild). My issue is that I have run into import errors, because my flow dependencies are not installed in the build environment. I am using
S3Storage
and
KubernetesRun
. Thanks in advance for any guidance.
I can share some code, but I think this is a pretty general problem. I can think of three solutions, but there may be more: (1) Install dependencies in build environment: I do not think this should be necessary, and different flows have different dependencies which may not be compatible. (2) Use something like the
extract_flow_from_file
utility function, but add a hook of some kind to remove the
import
statements. Haven't tried this. Seems like it should work, but also feels like a hack. (3) Put the flow registration commands in the Dockerfile for each flow.
j
I'd avoid option 2 if possible (it would work for script-based storage but not pickle based storage, and is a bit hacky). When registering a flow we currently need to have the full flow structure, so we do need to actually import your flow. If option 1 is too tricky for you, then executing the registration inside the same docker image they'll run in seems like a good option.
s
I'm doing something that's been working pretty nicely for my org thus far; we're using GitHub Actions to register our actions (using S3Storage, DockerRun config) to our cloud instance. For every flow that I want to register, I have a service in a
docker-compose.yml
that looks like this:
Copy code
example_flow:
  image: <EXAMPLE_FLOW_DOCKER_RUN_CONFIG_IMAGE>:latest
  volumes:
    - "./flows/example_flow/flow.py:/app/flow.py"
    - "./register_flow.sh:/app/register_flow.sh"
  command: "./register_flow.sh"
my github action is looking for changes in the
flows/example_flow/
path (i.e. where the flow code is hosted), and when it sees one, it pulls this compose service and then runs it. I use GitHub secrets in my workflow step to pass my cloud token as an environment variable into the container, and then it's all good to go
👍 2
b
OK great, thanks @Jim Crist-Harif. I had been building "toy" flows and focusing on the CI/CD process, looks like I should have built some "real" flows sooner so I could run into and resolve this problem. I got frustrated last night and thought up solution (3) this morning, so I'm glad you think it sounds reasonable.
s
the fact that i'm using the runconfig image as the compose service's image is key, because by definition it has all the flow's dependencies pre-installed in it
b
That's clever @Sean Talia, I'm going to look into this and see if it works with my setup. Thank you.
👍 1
s
so I don't need to worry about any extra considerations at all, I just need to make sure I get the flow code and the appropriate prefect tokens mounted into it when the service is spun up...and then that
register_flow.sh
script is incredibly lightweight, all it does is
prefect auth login --token $PREFECT_TENANT_TOKEN
and then calls the python script that registers the flow
b
Yeah that makes sense.
I'm trying to think through how you are telling the registration service where to look for flows.
s
if I'm interpreting your question correctly, (sorry I should have mentioned this), the first step of my GA workflow is using the publicly available
actions/checkout@v2
action, so this makes it incredibly easy because it's actually just checking out my entire repository onto the host that's been provisioned to run the GA workflow
b
Gotcha. I've never used github actions for anything non-trivial, but I think I understand what you mean. Watching for code changes on a specific path would let you do that.
👍 1
s
so when it checks out my whole repo and it comes time to register the flow, I've got my flow code, my
register_flow.sh
script, and my
docker-compose.yml
all available on the machine, and that's all that's needed
b
Guessing this could be done from the CI/CD itself. I don't necessarily want to give github permission to pull my images from ECR.
s
yeah setting this up was actually my first foray into using github actions for anything non-trivial but it's turned out to be very easy to work with 😄
ahhh well i didn't want to delve into unnecessary detail but I didn't want to do that either, so we're using a self-hosted GA runner (i.e. an EC2 instance that we own, that's running the github runner application on the instance and run workflows that are associated with our repo)
and i've given that EC2 instance an instance profile w/ policies that will let it talk to my container registry, S3 buckets, etc. etc.
b
oooo interesting. TIL you can do that with GA.
(And I totally understand, there's a ton of detail you have to leave out in order to communicate clearly).
s
i am by no means a GA, Prefect, or AWS expert, I'm just wading my way through the muck like many others out there and I've stumbled on something that seems to be working well
👍 1
b
Ha yeah, I understand.
j
Curious, what's your experience with self-hosting a GA runner? I have a few related setups that might benefit from this.
s
I've largely had a good experience with it so far – my one gripe is that the GA runner application by default will auto-update itself to whatever the newest version of the runner is; I think it will wait about a week after the release, and if it notices that it's still running an older version, it will shut down your actively running GA runner application, fetch the latest, install it, and bring up the newest version
there are a couple of threads where a lot of people complain about this behavior, and there doesn't seem to be an easy way to turn it off: https://github.com/actions/runner/issues/485
we're using it for now because i think it still is a net positive (it definitely beats having to use a github-provisioned server and then needing to pass container registry + prefect credentials to it), but I am open to the possibility of my org saying that the auto-update thing is a no-go
j
Thanks, this is all useful info.
b
@Sean Talia aside from GA, what CI/CD are you using?
s
you mean specifically for this use case with prefect? or just in general
b
if they are different, i'm interested to know why... but I do mean with prefect, yea.
s
a lot of the rest of my org uses jenkins for ci/cd stuff – our infra/devops team is pushing getting off of it and using github actions for everything. I don't think I'd be able to articulate very well why they made that decision, since it's not really my wheelhouse, but the fact that someone like me can do some non-trivial stuff pretty easily might be evidence in and of itself of why that's the direction they want to go in? My specific team (data engineering) has basically moved exclusively to using GA though – I don't think I've logged into our jenkins UI in a good 4-5 months
but in this use case there aren't any additional tools I'm using specifically for the CI/CD stuff w/ prefect...there's still some manual setup going on (e.g. I manually provisioned the EC2 instance that the github action runner is installed on using terraform and manually configured it using ansible before it was ready to go)
b
that is very interesting. i'm going to look into the GA thing, if only to be more aware of that option.
We have a very small devops team, and a small data eng team, and we pretty much exclusively use AWS codebuild/codepipeline, with assorted github actions for things like merging staging branches.
of course, everything is just shell scripts when it comes down to it so who cares what system it runs on, I guess.
s
it's turtles all the way down
I just have to re-iterate my disclaimer that I am not a professional
I don't know how gauge my setup other than by noting that it's working for our use cases so far 😇
b
Anyone who says otherwise is lying as far as I'm concerned haha
💯 2
a
@Billy McMonagle you should check out https://github.com/PrefectHQ/prefect/discussions/4042
b
thanks @Adam! I've actually posted in that discussion, although in part due to feedback from others in the community my setup has since evolved 🙂
h
@Billy McMonagle - did you end up going for option 3 from above? I don’t fully understand what you mean by “Put the flow registration commands in the Dockerfile for each flow.” I have one Dockerfile containing common Python dependencies. I’d prefer to only build this file if one of those dependencies changes. Like you I’m using CodePipeline/CodeBuild. This approach you suggest seems to have to require a rebuild from that Dockerfile each time we wish to register our flow.
b
@Hawkar Mahmod I wish I had more detail for you, unfortunately I haven't been making a ton of progress on my prefect setup. Yes, I am rebuilding each Dockerfile in CI/CD, which re-registers the flow. However, I'm utilizing the latest docker buildkit caching functionality, which makes the builds extremely fast if nothing has changed.
h
Ok sweet, that’s useful to know. How do you manage to utilise that across builds if the environment is ephemeral. Sorry I’m rather new to the use of CI/CD environments.
b
a totally reasonable question, it is not super obvious, especially since it's a fairly new (less than 1 year) feature. let me grab a snippet
here is a makefile i wrote to do the docker builds. this is not heavily battle tested but does seem to work so far.
Copy code
# list all flows here
FLOWS := \
  flow1
  flow2
  etc

build/%:
	@echo "building flows/${@F}"
	REPO=${REGISTRY}/${APP}/${@F} && \
	docker build --file flows/${@F}/Dockerfile . \
		--cache-from $$REPO:${GIT_SHA} \
		--cache-from $$REPO:${GIT_BRANCH} \
		--build-arg BUILDKIT_INLINE_CACHE=1 \
		--build-arg APP=${APP} \
		--build-arg AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID} \
		--build-arg AWS_REGION=${AWS_REGION} \
		--build-arg BUCKET_NAME=${BUCKET_NAME} \
		--build-arg FLOW_NAME=${@F} \
		--build-arg GIT_BRANCH=${GIT_BRANCH} \
		--build-arg GIT_SHA=${GIT_SHA} \
		--build-arg HELM_CHART=${HELM_CHART} \
		--build-arg HELM_RELEASE=${HELM_RELEASE} \
		--build-arg PREFECT_VERSION=${PREFECT_VERSION} \
		--build-arg PYTHON_RUNTIME=${PYTHON_RUNTIME} \
		--build-arg SSM_ENV=${SSM_ENV} \
		--secret id=prefect_token,src=prefect_token
		--tag $$REPO:${GIT_SHA} \
		--tag $$REPO:${GIT_BRANCH}

push/%:
	docker push "${REGISTRY}/${APP}/${@F}:${GIT_SHA}"
	docker push "${REGISTRY}/${APP}/${@F}:${GIT_BRANCH}"

## run targets for all flows
all: build push
build: $(foreach flow,${FLOWS},build/${flow})
push: $(foreach flow,${FLOWS},push/${flow})

.PHONY: all build push
there are many ways to configure this, but especially with a single dockerfile containing all dependencies, you should be able to set
BUILDKIT_INLINE_CACHE=1
and use the
--cache-from
argument pointed at your remote ECR repository and get some good results.
h
Ah I’ve never used a Makefile before but I am aware of them and I’m sure I’ll figure it out. This is just what I needed I think, puts me in the right direction. Thank you.
b
awesome, happy to help! my main reasoning to use the makefile is bc I intend to have many dockerfiles which I'd like to build in parallel. if this is not your use case you may just want to have a slightly simpler bash script (makefile syntax is WEIRD). fortunately i have a smart devops person i was able to copy much of my build scripting from.
h
A smart devops person, that’s a precious and rare resource 😄. Finally looking into this properly and am very satisfied with the solution. I already have a flow registration script that goes through my flows but I didn’t have the building part down. I presume the Dockerfiles for each flow contain the
register
call. But that does leave me a bit puzzled with the choice of Storage. Having read this thread and others there seems to be a recommendation to move away from the use of Docker Storage in favour of GitHub or S3. I chose to move to S3 Storage because I saw that the registration time was drastically reduced. However this deployment procedure you’ve used loses some of that benefit because you have to rebuild this image upon each deployment. The only reasons I think of to prefer S3 Storage over Docker Storage now are: a) Flexibility to change and put live flow code without a full deployment. You can just call
register
locally and have the Docker image from the existing deployment be used as your execution environment. b) To avoid the Docker-in-Docker scenario that would require you to build a Docker image, only to build another one upon calling
register
with
Docker
Storage.