Hello, Prefect experts. We are bootstrapping our analytics stack. I was evaluating different modern solutions. We have production data in SQL database. We want to work on some analytics cleanup & transformation and ship to PowerBI. Based on the research I have done, it looks to me the viable solution for us will be dbt + warehouse(snowflake) + visualization(power bi). If that is the case, it sounds like we do not need prefect to orchestrate python based tasks. Do you have case study that is applying prefect to analytics pipeline achitecture? Hope someone can give me some direction.
02/25/2022, 8:04 PM
@Hui Huang you may get a broader audience posting in #prefect-community
02/25/2022, 8:40 PM
As a person who has done a lot of data warehousing and data modelling in the past, I honestly can't imagine doing that without some orchestrator. Here are just some things to consider:
• How (otherwise) do you want to schedule, orchestrate and monitor all jobs syncing raw data to your staging area?
• How do you want to ensure the proper order of execution so that your dbt models don't run unless your source data is up-to-date?
• How do you want to ensure that extract-load jobs that regularly extract data from various systems and load it into your staging area were successful so that you can be sure that your dbt models are not built based on stale data?
• How do you want to run independent jobs in parallel to ensure that your data warehouse can be filled with fresh data fast? (rather than naively running once CRON job after another which would be extremely unreliable and could take days)
• How will you run, schedule, orchestrate and monitor custom scripts that can't be expressed in a SQL script incl. ML model training processes and data science experiments, jobs extracting data from custom APIs, reverse ETL processes (if needed), data validation tests to ensure data quality (e.g. using Great expectations or Pandera), custom automations sending you message when something occurs (e.g. when your data doesn't arrive on time, when some KPIs deviate from expected values or your data pipeline doesn't finish after a specific time)?
• How will you audit your processes to ensure data and process quality? Where can you find your logs?
• How will you get notified if something goes wrong?
• How will you even find out that something went wrong in your data pipelines/dbt models?
I could go on for hours on this but you get the idea 😄
I cannot imagine building a reliable data warehousing project without a proper workflow orchestration solution. If you don't pick one and use it right from the beginning, you will find yourself building your own orchestration solution over time. Check out this excellent talk by our CTO Chris White that explains the problem more:
02/26/2022, 6:27 PM
@Chris Reuter Will give a try in another channel. You guys are awesome. 🙂
@Anna Geller Hey!~~ What a treasure list you “throwed” at me. 🥲 But… I love it!
Also, i read some blog messaging about “negative” and “positive” engineering. Watching Chris’s talk, now, i have a better understanding about this contrast he is outlining. //I think he is too nice.. call it “negative” engineering. In my eyes, it is a mess. I hate those technical debts.
I will create a new post under Prefect-Community as Chris suggested. Thank you very much! 🙏