Hello everyone We re currently evaluating Prefect as our wor Prefect Community #ask-community

Hello everyone, We're currently evaluating Prefec...

Alexandru Sicoe

03/22/2021, 11:11 PM

Hello everyone, We're currently evaluating Prefect as our workflow scheduling solution. So far we love the simple API and architecture, especially with Prefect Cloud, which makes it very easy to get going. We do have a question however: What is the best practice for project structure, packaging and CI/CD with Prefect Cloud and Kubernetes Agent? We have multiple jobs across multiple github repos. These are generally various modules in various packages. These packages are mostly added to custom docker images tailored for other systems where we deploy. They also have complex dependencies from other public and private PyPi repos. Thanks, Alex P.S. Moved further details in the thread

Chris White

03/22/2021, 11:21 PM

Hi Alex and welcome! Would you mind reducing the size of your question, possibly by moving some paragraphs / code snippets into this thread? It makes it difficult to see the other questions in the channel - thank you!!

👍 1

Kyle Moon-Wright

03/22/2021, 11:53 PM

Hey @Alexandru Sicoe, Overall, I think you have a great plan of attack for structuring your Prefect flows, however there are a few things that are worth thinking about as you embark on this epic quest. hero First off, there are many project structures and CI/CD setups for your Prefect flows that accommodate different needs, so I’m certain you’ll discover nuances suited to your system. Typically, users will encapsulate their entire flow to a single Docker image to package dependencies per Flow, which will simplify scaling on Kubernetes with each module’s custom dependencies. Secondly, it looks like you’ll be calling the modules you’ve packaged in as tasks of a flow. If this works for you, then great! However, the real value of Prefect is the ability to author modular components of a greater set of tasks for visibility and configurability. So over time, you may discover you want more insight from Cloud on the status of a task beyond “did the script kick off?” like what error was propagated or retrying a singular computation of the module. To take this further, each module could be it’s own flow and orchestrated in a greater flow-of-flows. This decision will depend on the level of granularity you are seeking. Finally, I’d recommend each environment corresponding to a Prefect Agent with its own label (in this case, one with

dev

and the other with

prod

), so we can submit Flow Runs to each environment by matching up labels with those corresponding Agents polling our Cloud tenant. If you configure your Flow labels in your

Storage

Run_config

with the schedules you mentioned, these will be submitted to each environment respectively (assuming you have two Agents with each label) with the dynamic values you’ve configured per schedule.

Kyle Moon-Wright

03/22/2021, 11:53 PM

Hope that makes sense!

Alexandru Sicoe

03/23/2021, 8:38 AM

Moving details of our setup in the thread as suggested by @Chris White Our typical structure is something like:

Copy code

repo1/
    Dockerfile
    pkg1/
      mod1_1.py
      mod1_2.py
    pkg2/
      mod2_1.py
      mod2_2.py

repo2/
    Dockerfile
    pkg3/
      mod3_1.py
      mod3_2.py
      mod3_3.py

...

We have 2 environments:

dev

and

prod

. Most of these jobs need to be scheduled on

the same schedule

for both

dev

and

prod

. How would we bring all these jobs into Prefect in a uniform way? For starters we were thinking of applying the pattern suggested here: https://docs.prefect.io/orchestration/recipes/configuring_storage.html#configuring-docker-storage We were thinking of having another Dockerfile in each repo tailored for Prefect call it

Dockerfile-prefect

We would also have a python script at the top level in each repo, one for each package, that would create the flow for that package and register it. It would roughly do: 1. Create a task for every job in that package's every module. 2. Add them to a Flow. 3. The Flow would use the Docker Storage pointing to the

Dockerfile-prefect

file and using the

files

keyword to point to the module files. Then we would apply the pattern here: https://docs.prefect.io/core/concepts/schedules.html#varying-parameter-values 4. The flow would also have a Schedule with 2 identical clocks, one for each environment

dev

and

prod

but these Clocks will obviously have different

parameter_defaults

configs. 5. Call register on the flow At step 4 we have a problem, these configs being stuff like database hostnames, connection strings etc .... for either

dev

and

prod

, how do we load them dynamically for the various environments? Do we load them from env vars that would live in CI that ultimately will have to run the

prefect_*.py

files in the associated repo? E.g. steps 1-5 for

repo1

for

pkg1

above would produce a

prefect_pkg1.py

script like:

Copy code

import datetime
import os
from pkg1 import mod1_1, mod1_2
from prefect.schedules import clocks, Schedule
from prefect.storage import Docker

now = datetime.datetime.utcnow()

# Create our Docker storage object
storage = Docker(registry_url="<http://gcr.io/dev/|gcr.io/dev/>",
                 dockerfile="../Dockerfile-prefect")

# Create our Schedule
clock1 = clocks.IntervalClock(start_date=now,
                              interval=datetime.timedelta(hours=1),
                              parameter_defaults={
                               "db_server_name": os.getenv("DEV_DB_SERVER_NAME")
                              }) # there will be more
clock2 = clocks.IntervalClock(start_date=now,
                              interval=datetime.timedelta(hours=1),
                              parameter_defaults={
                               "db_server_name": os.getenv("DEV_DB_SERVER_NAME")
                              }) # there will be more
schedule = Schedule(clocks=[clock1, clock2])

@task
def task1():
  mod1_1.execute()

@task
def task2():
  mod1_2.execute()

flow = Flow("flow_pkg1", tasks=[task1, task2], schedule=schedule, storage=storage)

flow.register()

So the structure for

repo1

would become:

Copy code

repo1/
    Dockerfile
    Dockerfile-prefect    
    pkg1/
      mod1_1.py
      mod1_2.py
    pkg2/
      mod2_1.py
      mod2_2.py
    prefect_pkg1.py
    prefect_pkg2.py

And then we would need to configure Github actions to execute each script that starts with "prefect_" at the top of the repo. Does that look ok? Is there a better pattern? I apologise for the very long message .... would greatly appreciate any feedback!

🙏 1

Alexandru Sicoe

03/23/2021, 8:42 AM

Thanks for the feedback on our intended plan of action @Kyle Moon-Wright and for the patterns you suggested (flow of flows and label matching per environment) which make sense. We'll keep on at it and come back with follow-ups ;)

Open in Slack

Previous Next