Alexandru Sicoe

    Alexandru Sicoe

    1 year ago
    Hello everyone, We're currently evaluating Prefect as our workflow scheduling solution. So far we love the simple API and architecture, especially with Prefect Cloud, which makes it very easy to get going. We do have a question however: What is the best practice for project structure, packaging and CI/CD with Prefect Cloud and Kubernetes Agent? We have multiple jobs across multiple github repos. These are generally various modules in various packages. These packages are mostly added to custom docker images tailored for other systems where we deploy. They also have complex dependencies from other public and private PyPi repos. Thanks, Alex P.S. Moved further details in the thread
    Chris White

    Chris White

    1 year ago
    Hi Alex and welcome! Would you mind reducing the size of your question, possibly by moving some paragraphs / code snippets into this thread? It makes it difficult to see the other questions in the channel - thank you!!
    Kyle Moon-Wright

    Kyle Moon-Wright

    1 year ago
    Hey @Alexandru Sicoe, Overall, I think you have a great plan of attack for structuring your Prefect flows, however there are a few things that are worth thinking about as you embark on this epic quest. :hero: First off, there are many project structures and CI/CD setups for your Prefect flows that accommodate different needs, so I’m certain you’ll discover nuances suited to your system. Typically, users will encapsulate their entire flow to a single Docker image to package dependencies per Flow, which will simplify scaling on Kubernetes with each module’s custom dependencies. Secondly, it looks like you’ll be calling the modules you’ve packaged in as tasks of a flow. If this works for you, then great! However, the real value of Prefect is the ability to author modular components of a greater set of tasks for visibility and configurability. So over time, you may discover you want more insight from Cloud on the status of a task beyond “did the script kick off?” like what error was propagated or retrying a singular computation of the module. To take this further, each module could be it’s own flow and orchestrated in a greater flow-of-flows. This decision will depend on the level of granularity you are seeking. Finally, I’d recommend each environment corresponding to a Prefect Agent with its own label (in this case, one with
    dev
    and the other with
    prod
    ), so we can submit Flow Runs to each environment by matching up labels with those corresponding Agents polling our Cloud tenant. If you configure your Flow labels in your
    Storage
    or
    Run_config
    with the schedules you mentioned, these will be submitted to each environment respectively (assuming you have two Agents with each label) with the dynamic values you’ve configured per schedule.
    Hope that makes sense!
    Alexandru Sicoe

    Alexandru Sicoe

    1 year ago
    Moving details of our setup in the thread as suggested by @Chris White Our typical structure is something like:
    repo1/
        Dockerfile
        pkg1/
          mod1_1.py
          mod1_2.py
        pkg2/
          mod2_1.py
          mod2_2.py
    
    repo2/
        Dockerfile
        pkg3/
          mod3_1.py
          mod3_2.py
          mod3_3.py
    
    ...
    We have 2 environments:
    dev
    and
    prod
    . Most of these jobs need to be scheduled on
    the same schedule
    for both
    dev
    and
    prod
    . How would we bring all these jobs into Prefect in a uniform way? For starters we were thinking of applying the pattern suggested here: https://docs.prefect.io/orchestration/recipes/configuring_storage.html#configuring-docker-storage We were thinking of having another Dockerfile in each repo tailored for Prefect call it
    Dockerfile-prefect
    We would also have a python script at the top level in each repo, one for each package, that would create the flow for that package and register it. It would roughly do: 1. Create a task for every job in that package's every module. 2. Add them to a Flow. 3. The Flow would use the Docker Storage pointing to the
    Dockerfile-prefect
    file and using the
    files
    keyword to point to the module files. Then we would apply the pattern here: https://docs.prefect.io/core/concepts/schedules.html#varying-parameter-values 4. The flow would also have a Schedule with 2 identical clocks, one for each environment
    dev
    and
    prod
    but these Clocks will obviously have different
    parameter_defaults
    configs. 5. Call register on the flow At step 4 we have a problem, these configs being stuff like database hostnames, connection strings etc .... for either
    dev
    and
    prod
    , how do we load them dynamically for the various environments? Do we load them from env vars that would live in CI that ultimately will have to run the
    prefect_*.py
    files in the associated repo? E.g. steps 1-5 for
    repo1
    for
    pkg1
    above would produce a
    prefect_pkg1.py
    script like:
    import datetime
    import os
    from pkg1 import mod1_1, mod1_2
    from prefect.schedules import clocks, Schedule
    from prefect.storage import Docker
    
    now = datetime.datetime.utcnow()
    
    # Create our Docker storage object
    storage = Docker(registry_url="<http://gcr.io/dev/|gcr.io/dev/>",
                     dockerfile="../Dockerfile-prefect")
    
    # Create our Schedule
    clock1 = clocks.IntervalClock(start_date=now,
                                  interval=datetime.timedelta(hours=1),
                                  parameter_defaults={
                                   "db_server_name": os.getenv("DEV_DB_SERVER_NAME")
                                  }) # there will be more
    clock2 = clocks.IntervalClock(start_date=now,
                                  interval=datetime.timedelta(hours=1),
                                  parameter_defaults={
                                   "db_server_name": os.getenv("DEV_DB_SERVER_NAME")
                                  }) # there will be more
    schedule = Schedule(clocks=[clock1, clock2])
    
    @task
    def task1():
      mod1_1.execute()
    
    @task
    def task2():
      mod1_2.execute()
    
    flow = Flow("flow_pkg1", tasks=[task1, task2], schedule=schedule, storage=storage)
    
    flow.register()
    So the structure for
    repo1
    would become:
    repo1/
        Dockerfile
        Dockerfile-prefect    
        pkg1/
          mod1_1.py
          mod1_2.py
        pkg2/
          mod2_1.py
          mod2_2.py
        prefect_pkg1.py
        prefect_pkg2.py
    And then we would need to configure Github actions to execute each script that starts with "prefect_" at the top of the repo. Does that look ok? Is there a better pattern? I apologise for the very long message .... would greatly appreciate any feedback!
    Thanks for the feedback on our intended plan of action @Kyle Moon-Wright and for the patterns you suggested (flow of flows and label matching per environment) which make sense. We'll keep on at it and come back with follow-ups 😉