https://prefect.io logo
Title
j

Jacob Blanco

05/25/2022, 2:29 AM
We are currently using Github releases (tags more specifically) and CircleCI to publish flows into AWS ECR, we are doing this from a monolithic repository so that our data scientists can publish their flows in a standardized way without the need for them to worry about infrastructure, etc. We use a hash on the flow file to determine if the file has changed since the last release and deploy all the flows that have changed. We are running into an issue in Staging whereby one release from one person is clobbering the release made by another person (since they are working in different branches). Does anyone have a similar setup? How have you addressed the issue above? I supposed some kind of version ordering would address this issue. As in if the registered version in Cloud is > than the version to be deployed, then don’t deploy.
1
s

Samuel Hinton

05/25/2022, 2:54 AM
We’re coming up to a similar decision (about uploading flows on tag using a pipeline), so commenting so I can also see what the recs are for this situation
a

Anna Geller

05/25/2022, 11:32 AM
There are so many ways you could approach it. I will give you some suggestions as food for thought and you can decide what suits your use case best: 1. Decouple ECR image build from flow registration process - you mentioned ECR build process being used to publish flows - does it mean you are using Docker storage and bounding code dependencies with flow code storage? If so, decoupling both from each other (i.e. using ECR build to package code dependencies and sth like GitHub or S3 storage for flow storage) may help here 2. Add multiple workflow definitions, i.e. a different workflow for each team/use case/Git branch to decentralize the process and make it easier to manage and reduce the surface area of some users stepping on each other's toes 3. You mentioned you use hash - I assume you mean serialized flow hash as in
flow.register(project_name="xxx", idempotency_key=flow.serialized_hash())
- if so, using CLI prefect register instead can help here as it automatically detects flow changes and you can use it to register only directories relevant for your team - say, if data scientists have their own folder in your shared repo, their CI/CD workflow might include the command:
prefect register --project xxx -p data_science_flows/
j

Jacob Blanco

05/26/2022, 1:27 AM
1. We do some kind of a hybrid where we package most of the dependencies in a base Docker image, and then create a new image with the flow published again to ECR upon “deployment”. It’s an interesting thought and maybe a change we can consider while moving to 2.0 2. The multiple workflow definition is definitely an interesting approach. The most straight forward way I can envision doing this is to a) split out the repo into multiple owners, or b) split out the Flows folder structure by Team/Owner rather than Theme/Database like we do now (this would also help us with issues around Codeownership for PR reviews). 3. Actually our pipeline pre-dates the
idempotency_key
and we don’t calculate our own and instead use the git diff between the latest staging/production tag and the latest production tag (which is an imperfect solution of course). We are indeed using
flow.register
so your point still stands.
a

Anna Geller

05/30/2022, 2:26 PM
Catching up after holidays - LMK if there is anything you'd like to discuss here, I have no more ideas, probably best if you explore various options and iterate on it and reflect what works and what doesn't
j

Jacob Blanco

05/31/2022, 2:35 AM
Thank you Anna, you’ve given me a lot of food for thought
🙌 1
Sorry to bring this thread back from the depths of Tartarus, but I’ve had a closer look at your proposed solution using, what I assume is, a combination of Github storage with DockerRun. Is that correct? The benefits I see are: • Way less tags/deployments/registrations since you only need to register a change to the flow if: the git ref is changed or the structure of the flow has been changed • This will in turn shorten the testing cycle so instead of “make code change, push to branch, create staging tag, wait for re-registration, run flow” to “Make code change, push to branch, run flow” • We can even have per-flow branch mapping configuration so we couldn’t clobber each other code changes. ◦ We could even, as you suggested maintain totally parallel branches for each team so re-registration can happen in totally independent flows • Also by decoupling the base image from the flow, it will shorten the cycle of updating flow dependencies because all we need to do is deploy the new base image with the latest tag and it will get picked up in the next flow run. • It also reduces the overall number of images stored everywhere so maintenance is less of a hassle. The only downside I see is that we need to teach people when the flow has to be re-registered, or is there a way to determine that from the flow object? I’m thinking we can trigger that on merge into the feature branch.
a

Anna Geller

06/08/2022, 12:15 PM
before we dive deeper here and before you start any process optimizations - what's your upgrade plan to Prefect 2.0? not sure whether optimizing the process for 1.0 makes sense now - it depends on when/how do you plan to migrate
definitely separating the code dependencies from flow storage is the right first step to make redeploying changes to shared modules/base images more independent of flow code changes not to give you a disappointing answer, but everything else is something you would need to experiment with your team to see what works best for them - education is for sure part of it, but reregistering the flow from Prefect CLI should work fine from CI - if no change to flow metadata occurred, the registration is skipped - this makes it all quite straightforward
👍 1
j

Jacob Blanco

06/08/2022, 2:11 PM
before we dive deeper here and before you start any process optimizations - what’s your upgrade plan to Prefect 2.0? not sure whether optimizing the process for 1.0 makes sense now - it depends on when/how do you plan to migrate
We are still waiting on confirmation of when Prefect 2.0 will be released so that we can start working backwards towards an upgrade plan. We are in the midst of renewing our contract are moving to 2.0 as a result. I’ve not had a deep look at 2.0 and how that impacts deployment yet, but I think philosophically speaking separating the dependency from the flow definition makes sense regardless. I’ll take a look more deeply at 2.0 and see what our to-be deployment pipeline should look like, and maybe come back with more specific questions. Thanks.
a

Anna Geller

06/09/2022, 11:23 AM
separating the dependency from the flow definition makes sense regardless
Couldn't agree more! 💯 That makes sense. Keep us posted!