Hello everyone, We're currently evaluating Prefec...
# ask-community
a
Hello everyone, We're currently evaluating Prefect as our workflow scheduling solution. So far we love the simple API and architecture, especially with Prefect Cloud, which makes it very easy to get going. We do have a question however: What is the best practice for project structure, packaging and CI/CD with Prefect Cloud and Kubernetes Agent? We have multiple jobs across multiple github repos. These are generally various modules in various packages. These packages are mostly added to custom docker images tailored for other systems where we deploy. They also have complex dependencies from other public and private PyPi repos. Thanks, Alex P.S. Moved further details in the thread
c
Hi Alex and welcome! Would you mind reducing the size of your question, possibly by moving some paragraphs / code snippets into this thread? It makes it difficult to see the other questions in the channel - thank you!!
👍 1
k
Hey @Alexandru Sicoe, Overall, I think you have a great plan of attack for structuring your Prefect flows, however there are a few things that are worth thinking about as you embark on this epic quest. hero First off, there are many project structures and CI/CD setups for your Prefect flows that accommodate different needs, so I’m certain you’ll discover nuances suited to your system. Typically, users will encapsulate their entire flow to a single Docker image to package dependencies per Flow, which will simplify scaling on Kubernetes with each module’s custom dependencies. Secondly, it looks like you’ll be calling the modules you’ve packaged in as tasks of a flow. If this works for you, then great! However, the real value of Prefect is the ability to author modular components of a greater set of tasks for visibility and configurability. So over time, you may discover you want more insight from Cloud on the status of a task beyond “did the script kick off?” like what error was propagated or retrying a singular computation of the module. To take this further, each module could be it’s own flow and orchestrated in a greater flow-of-flows. This decision will depend on the level of granularity you are seeking. Finally, I’d recommend each environment corresponding to a Prefect Agent with its own label (in this case, one with
dev
and the other with
prod
), so we can submit Flow Runs to each environment by matching up labels with those corresponding Agents polling our Cloud tenant. If you configure your Flow labels in your
Storage
or
Run_config
with the schedules you mentioned, these will be submitted to each environment respectively (assuming you have two Agents with each label) with the dynamic values you’ve configured per schedule.
Hope that makes sense!
a
Moving details of our setup in the thread as suggested by @Chris White Our typical structure is something like:
Copy code
repo1/
    Dockerfile
    pkg1/
      mod1_1.py
      mod1_2.py
    pkg2/
      mod2_1.py
      mod2_2.py

repo2/
    Dockerfile
    pkg3/
      mod3_1.py
      mod3_2.py
      mod3_3.py

...
We have 2 environments:
dev
and
prod
. Most of these jobs need to be scheduled on
the same schedule
for both
dev
and
prod
. How would we bring all these jobs into Prefect in a uniform way? For starters we were thinking of applying the pattern suggested here: https://docs.prefect.io/orchestration/recipes/configuring_storage.html#configuring-docker-storage We were thinking of having another Dockerfile in each repo tailored for Prefect call it
Dockerfile-prefect
We would also have a python script at the top level in each repo, one for each package, that would create the flow for that package and register it. It would roughly do: 1. Create a task for every job in that package's every module. 2. Add them to a Flow. 3. The Flow would use the Docker Storage pointing to the
Dockerfile-prefect
file and using the
files
keyword to point to the module files. Then we would apply the pattern here: https://docs.prefect.io/core/concepts/schedules.html#varying-parameter-values 4. The flow would also have a Schedule with 2 identical clocks, one for each environment
dev
and
prod
but these Clocks will obviously have different
parameter_defaults
configs. 5. Call register on the flow At step 4 we have a problem, these configs being stuff like database hostnames, connection strings etc .... for either
dev
and
prod
, how do we load them dynamically for the various environments? Do we load them from env vars that would live in CI that ultimately will have to run the
prefect_*.py
files in the associated repo? E.g. steps 1-5 for
repo1
for
pkg1
above would produce a
prefect_pkg1.py
script like:
Copy code
import datetime
import os
from pkg1 import mod1_1, mod1_2
from prefect.schedules import clocks, Schedule
from prefect.storage import Docker

now = datetime.datetime.utcnow()

# Create our Docker storage object
storage = Docker(registry_url="<http://gcr.io/dev/|gcr.io/dev/>",
                 dockerfile="../Dockerfile-prefect")

# Create our Schedule
clock1 = clocks.IntervalClock(start_date=now,
                              interval=datetime.timedelta(hours=1),
                              parameter_defaults={
                               "db_server_name": os.getenv("DEV_DB_SERVER_NAME")
                              }) # there will be more
clock2 = clocks.IntervalClock(start_date=now,
                              interval=datetime.timedelta(hours=1),
                              parameter_defaults={
                               "db_server_name": os.getenv("DEV_DB_SERVER_NAME")
                              }) # there will be more
schedule = Schedule(clocks=[clock1, clock2])

@task
def task1():
  mod1_1.execute()

@task
def task2():
  mod1_2.execute()

flow = Flow("flow_pkg1", tasks=[task1, task2], schedule=schedule, storage=storage)

flow.register()
So the structure for
repo1
would become:
Copy code
repo1/
    Dockerfile
    Dockerfile-prefect    
    pkg1/
      mod1_1.py
      mod1_2.py
    pkg2/
      mod2_1.py
      mod2_2.py
    prefect_pkg1.py
    prefect_pkg2.py
And then we would need to configure Github actions to execute each script that starts with "prefect_" at the top of the repo. Does that look ok? Is there a better pattern? I apologise for the very long message .... would greatly appreciate any feedback!
🙏 1
Thanks for the feedback on our intended plan of action @Kyle Moon-Wright and for the patterns you suggested (flow of flows and label matching per environment) which make sense. We'll keep on at it and come back with follow-ups ;)