Hello Prefect folks I m searching for a pattern to manage de Prefect Community #ask-community

Hello Prefect folks! I'm searching for a pattern ...

merlin

01/03/2023, 10:52 PM

Hello Prefect folks! I'm searching for a pattern to manage dependencies of deployments. The 'hello world' way of calling a flow from another flow isn't a good fit as you will see by the components of my small data engineering pipeline: 1. Deployments A, B, C: DDL flow that creates tables on a DB. The flow is parametrized, the parameter is the filename of a DDL SQL script sent to a DB. So I have 3 deployments of same flow calling different scripts. 2. Deployment X: Data extracts that pull from tables built in deployments A, B and C. This is a single flow parametrized by the filename of several extract SQL scripts. I want to make sure Deployment X runs only after Deployments A, B, C have run each day. Currently I manage ordering by the cron schedule, however this isn't reliable if the agent isn't continuously running, or if a DDL job takes long time, etc. Can I have Deployment X explicitly wait on runs for other deployments (parametrized) run date?

merlin

01/03/2023, 10:55 PM

I did see this thread about `run_deployment` however I don't see a clean fit to that approach. It could work, but I like the loose coupling in my current model, rather than creating direct chains of deployment runs. In other words, I have some tables rebuilt every day, needed for various jobs. Those dependent jobs will come and go, so having them call the table creation deployments creates a tighter coupling that needs to be touched when changing the 'consuming' data jobs.

Zanie

01/04/2023, 4:09 PM

This is a tricky one, I believe this will be handled by automations in the future cc @Will Raphaelson.

Will Raphaelson

01/04/2023, 4:43 PM

Hmm, so currently I believe you could satisfy this with simple automations created via UI if you were okay with the ddls running in parallel. You could set up automations to run X only after C only after B only after A. Because A, B, and C are of the same flow, youd want to listen for the flow run completions by tag. If you were set on the DDLs running in parallel, I believe you could create one automation that would satisfy this via the API by using the threshold - i.e wait for 3 flow runs with the DDL tag to enter completed before running X.

merlin

01/04/2023, 5:04 PM

is there a way to check the state of a deployment run without actually calling the

run_deployment

method?

Copy code

if previously_run_deployment(parameter1 = "2022-12-25").state.name = "Completed":
    # go ahead with this dependent flow here

I'm open to refactoring my DDL and extract jobs, I'm looking for a pattern that handles a data engineering setup that manages DDL jobs and various extracts, some of which have dependencies.

Zanie

01/04/2023, 6:02 PM

You can read the flow run and look at the state

Zanie

01/04/2023, 6:03 PM

(i.e. with the client)

merlin

01/05/2023, 5:15 AM

Thanks I'll try to learn how to do that. From here its more of a data engineer's design question. I guess the pattern starts looking like: • create a dependency checking flow, it takes table names as parameter, and run date. ◦ dependency check sees if DDL flow(table = my_table, rundate = "2022-12-31") state == completed. ◦ then proceed else wait for timeout • dependency checking flow used as subflow in all DDL, extract flows • every new "SQL job" has three parts: ◦ sql file to send to DB ◦ list of table dependencies (put in deployment yaml) ◦ create deployment for it

7 Views

Open in Slack

Previous Next