Thread
#prefect-community
    Billy McMonagle

    Billy McMonagle

    1 year ago
    Good morning, I have a question about flow registration. I'd like to register my flows during CI/CD (specifically, AWS Codebuild). My issue is that I have run into import errors, because my flow dependencies are not installed in the build environment. I am using
    S3Storage
    and
    KubernetesRun
    . Thanks in advance for any guidance.
    I can share some code, but I think this is a pretty general problem. I can think of three solutions, but there may be more: (1) Install dependencies in build environment: I do not think this should be necessary, and different flows have different dependencies which may not be compatible. (2) Use something like the
    extract_flow_from_file
    utility function, but add a hook of some kind to remove the
    import
    statements. Haven't tried this. Seems like it should work, but also feels like a hack. (3) Put the flow registration commands in the Dockerfile for each flow.
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    I'd avoid option 2 if possible (it would work for script-based storage but not pickle based storage, and is a bit hacky). When registering a flow we currently need to have the full flow structure, so we do need to actually import your flow. If option 1 is too tricky for you, then executing the registration inside the same docker image they'll run in seems like a good option.
    Sean Talia

    Sean Talia

    1 year ago
    I'm doing something that's been working pretty nicely for my org thus far; we're using GitHub Actions to register our actions (using S3Storage, DockerRun config) to our cloud instance. For every flow that I want to register, I have a service in a
    docker-compose.yml
    that looks like this:
    example_flow:
      image: <EXAMPLE_FLOW_DOCKER_RUN_CONFIG_IMAGE>:latest
      volumes:
        - "./flows/example_flow/flow.py:/app/flow.py"
        - "./register_flow.sh:/app/register_flow.sh"
      command: "./register_flow.sh"
    my github action is looking for changes in the
    flows/example_flow/
    path (i.e. where the flow code is hosted), and when it sees one, it pulls this compose service and then runs it. I use GitHub secrets in my workflow step to pass my cloud token as an environment variable into the container, and then it's all good to go
    Billy McMonagle

    Billy McMonagle

    1 year ago
    OK great, thanks @Jim Crist-Harif. I had been building "toy" flows and focusing on the CI/CD process, looks like I should have built some "real" flows sooner so I could run into and resolve this problem. I got frustrated last night and thought up solution (3) this morning, so I'm glad you think it sounds reasonable.
    Sean Talia

    Sean Talia

    1 year ago
    the fact that i'm using the runconfig image as the compose service's image is key, because by definition it has all the flow's dependencies pre-installed in it
    Billy McMonagle

    Billy McMonagle

    1 year ago
    That's clever @Sean Talia, I'm going to look into this and see if it works with my setup. Thank you.
    Sean Talia

    Sean Talia

    1 year ago
    so I don't need to worry about any extra considerations at all, I just need to make sure I get the flow code and the appropriate prefect tokens mounted into it when the service is spun up...and then that
    register_flow.sh
    script is incredibly lightweight, all it does is
    prefect auth login --token $PREFECT_TENANT_TOKEN
    and then calls the python script that registers the flow
    Billy McMonagle

    Billy McMonagle

    1 year ago
    Yeah that makes sense.
    I'm trying to think through how you are telling the registration service where to look for flows.
    Sean Talia

    Sean Talia

    1 year ago
    if I'm interpreting your question correctly, (sorry I should have mentioned this), the first step of my GA workflow is using the publicly available
    actions/checkout@v2
    action, so this makes it incredibly easy because it's actually just checking out my entire repository onto the host that's been provisioned to run the GA workflow
    Billy McMonagle

    Billy McMonagle

    1 year ago
    Gotcha. I've never used github actions for anything non-trivial, but I think I understand what you mean. Watching for code changes on a specific path would let you do that.
    Sean Talia

    Sean Talia

    1 year ago
    so when it checks out my whole repo and it comes time to register the flow, I've got my flow code, my
    register_flow.sh
    script, and my
    docker-compose.yml
    all available on the machine, and that's all that's needed
    Billy McMonagle

    Billy McMonagle

    1 year ago
    Guessing this could be done from the CI/CD itself. I don't necessarily want to give github permission to pull my images from ECR.
    Sean Talia

    Sean Talia

    1 year ago
    yeah setting this up was actually my first foray into using github actions for anything non-trivial but it's turned out to be very easy to work with 😄
    ahhh well i didn't want to delve into unnecessary detail but I didn't want to do that either, so we're using a self-hosted GA runner (i.e. an EC2 instance that we own, that's running the github runner application on the instance and run workflows that are associated with our repo)
    and i've given that EC2 instance an instance profile w/ policies that will let it talk to my container registry, S3 buckets, etc. etc.
    Billy McMonagle

    Billy McMonagle

    1 year ago
    oooo interesting. TIL you can do that with GA.
    (And I totally understand, there's a ton of detail you have to leave out in order to communicate clearly).
    Sean Talia

    Sean Talia

    1 year ago
    i am by no means a GA, Prefect, or AWS expert, I'm just wading my way through the muck like many others out there and I've stumbled on something that seems to be working well
    Billy McMonagle

    Billy McMonagle

    1 year ago
    Ha yeah, I understand.
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Curious, what's your experience with self-hosting a GA runner? I have a few related setups that might benefit from this.
    Sean Talia

    Sean Talia

    1 year ago
    I've largely had a good experience with it so far – my one gripe is that the GA runner application by default will auto-update itself to whatever the newest version of the runner is; I think it will wait about a week after the release, and if it notices that it's still running an older version, it will shut down your actively running GA runner application, fetch the latest, install it, and bring up the newest version
    there are a couple of threads where a lot of people complain about this behavior, and there doesn't seem to be an easy way to turn it off: https://github.com/actions/runner/issues/485
    we're using it for now because i think it still is a net positive (it definitely beats having to use a github-provisioned server and then needing to pass container registry + prefect credentials to it), but I am open to the possibility of my org saying that the auto-update thing is a no-go
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Thanks, this is all useful info.
    Billy McMonagle

    Billy McMonagle

    1 year ago
    @Sean Talia aside from GA, what CI/CD are you using?
    Sean Talia

    Sean Talia

    1 year ago
    you mean specifically for this use case with prefect? or just in general
    Billy McMonagle

    Billy McMonagle

    1 year ago
    if they are different, i'm interested to know why... but I do mean with prefect, yea.
    Sean Talia

    Sean Talia

    1 year ago
    a lot of the rest of my org uses jenkins for ci/cd stuff – our infra/devops team is pushing getting off of it and using github actions for everything. I don't think I'd be able to articulate very well why they made that decision, since it's not really my wheelhouse, but the fact that someone like me can do some non-trivial stuff pretty easily might be evidence in and of itself of why that's the direction they want to go in? My specific team (data engineering) has basically moved exclusively to using GA though – I don't think I've logged into our jenkins UI in a good 4-5 months
    but in this use case there aren't any additional tools I'm using specifically for the CI/CD stuff w/ prefect...there's still some manual setup going on (e.g. I manually provisioned the EC2 instance that the github action runner is installed on using terraform and manually configured it using ansible before it was ready to go)
    Billy McMonagle

    Billy McMonagle

    1 year ago
    that is very interesting. i'm going to look into the GA thing, if only to be more aware of that option.
    We have a very small devops team, and a small data eng team, and we pretty much exclusively use AWS codebuild/codepipeline, with assorted github actions for things like merging staging branches.
    of course, everything is just shell scripts when it comes down to it so who cares what system it runs on, I guess.
    Sean Talia

    Sean Talia

    1 year ago
    it's turtles all the way down
    I just have to re-iterate my disclaimer that I am not a professional
    I don't know how gauge my setup other than by noting that it's working for our use cases so far 😇
    Billy McMonagle

    Billy McMonagle

    1 year ago
    Anyone who says otherwise is lying as far as I'm concerned haha
    Adam

    Adam

    1 year ago
    @Billy McMonagle you should check out https://github.com/PrefectHQ/prefect/discussions/4042
    Billy McMonagle

    Billy McMonagle

    1 year ago
    thanks @Adam! I've actually posted in that discussion, although in part due to feedback from others in the community my setup has since evolved 🙂
    h

    Hawkar Mahmod

    1 year ago
    @Billy McMonagle - did you end up going for option 3 from above? I don’t fully understand what you mean by “Put the flow registration commands in the Dockerfile for each flow.” I have one Dockerfile containing common Python dependencies. I’d prefer to only build this file if one of those dependencies changes. Like you I’m using CodePipeline/CodeBuild. This approach you suggest seems to have to require a rebuild from that Dockerfile each time we wish to register our flow.
    Billy McMonagle

    Billy McMonagle

    1 year ago
    @Hawkar Mahmod I wish I had more detail for you, unfortunately I haven't been making a ton of progress on my prefect setup. Yes, I am rebuilding each Dockerfile in CI/CD, which re-registers the flow. However, I'm utilizing the latest docker buildkit caching functionality, which makes the builds extremely fast if nothing has changed.
    h

    Hawkar Mahmod

    1 year ago
    Ok sweet, that’s useful to know. How do you manage to utilise that across builds if the environment is ephemeral. Sorry I’m rather new to the use of CI/CD environments.
    Billy McMonagle

    Billy McMonagle

    1 year ago
    a totally reasonable question, it is not super obvious, especially since it's a fairly new (less than 1 year) feature. let me grab a snippet
    here is a makefile i wrote to do the docker builds. this is not heavily battle tested but does seem to work so far.
    # list all flows here
    FLOWS := \
      flow1
      flow2
      etc
    
    build/%:
    	@echo "building flows/${@F}"
    	REPO=${REGISTRY}/${APP}/${@F} && \
    	docker build --file flows/${@F}/Dockerfile . \
    		--cache-from $$REPO:${GIT_SHA} \
    		--cache-from $$REPO:${GIT_BRANCH} \
    		--build-arg BUILDKIT_INLINE_CACHE=1 \
    		--build-arg APP=${APP} \
    		--build-arg AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID} \
    		--build-arg AWS_REGION=${AWS_REGION} \
    		--build-arg BUCKET_NAME=${BUCKET_NAME} \
    		--build-arg FLOW_NAME=${@F} \
    		--build-arg GIT_BRANCH=${GIT_BRANCH} \
    		--build-arg GIT_SHA=${GIT_SHA} \
    		--build-arg HELM_CHART=${HELM_CHART} \
    		--build-arg HELM_RELEASE=${HELM_RELEASE} \
    		--build-arg PREFECT_VERSION=${PREFECT_VERSION} \
    		--build-arg PYTHON_RUNTIME=${PYTHON_RUNTIME} \
    		--build-arg SSM_ENV=${SSM_ENV} \
    		--secret id=prefect_token,src=prefect_token
    		--tag $$REPO:${GIT_SHA} \
    		--tag $$REPO:${GIT_BRANCH}
    
    push/%:
    	docker push "${REGISTRY}/${APP}/${@F}:${GIT_SHA}"
    	docker push "${REGISTRY}/${APP}/${@F}:${GIT_BRANCH}"
    
    ## run targets for all flows
    all: build push
    build: $(foreach flow,${FLOWS},build/${flow})
    push: $(foreach flow,${FLOWS},push/${flow})
    
    .PHONY: all build push
    there are many ways to configure this, but especially with a single dockerfile containing all dependencies, you should be able to set
    BUILDKIT_INLINE_CACHE=1
    and use the
    --cache-from
    argument pointed at your remote ECR repository and get some good results.
    h

    Hawkar Mahmod

    1 year ago
    Ah I’ve never used a Makefile before but I am aware of them and I’m sure I’ll figure it out. This is just what I needed I think, puts me in the right direction. Thank you.
    Billy McMonagle

    Billy McMonagle

    1 year ago
    awesome, happy to help! my main reasoning to use the makefile is bc I intend to have many dockerfiles which I'd like to build in parallel. if this is not your use case you may just want to have a slightly simpler bash script (makefile syntax is WEIRD). fortunately i have a smart devops person i was able to copy much of my build scripting from.
    h

    Hawkar Mahmod

    1 year ago
    A smart devops person, that’s a precious and rare resource 😄. Finally looking into this properly and am very satisfied with the solution. I already have a flow registration script that goes through my flows but I didn’t have the building part down. I presume the Dockerfiles for each flow contain the
    register
    call. But that does leave me a bit puzzled with the choice of Storage. Having read this thread and others there seems to be a recommendation to move away from the use of Docker Storage in favour of GitHub or S3. I chose to move to S3 Storage because I saw that the registration time was drastically reduced. However this deployment procedure you’ve used loses some of that benefit because you have to rebuild this image upon each deployment. The only reasons I think of to prefer S3 Storage over Docker Storage now are: a) Flexibility to change and put live flow code without a full deployment. You can just call
    register
    locally and have the Docker image from the existing deployment be used as your execution environment. b) To avoid the Docker-in-Docker scenario that would require you to build a Docker image, only to build another one upon calling
    register
    with
    Docker
    Storage.