Hello everyone 👋
I’m a data engineer tasked with choosing our new orchestration/monitoring tools and am interested in using Prefect with Redshift.
Is this a design pattern that other Redshifters are using?
1. api call to gather source data
2. schema test
3. write data to s3
4. copy command to redshift temp table (using psycopg2 connection)
5. append/insert/merge to staging table
04/10/2020, 4:35 PM
Our team is in GCP, but there are a lot of parallels to a Flow we have running right now. Our steps are almost identical to yours, except we gather the source data, save it directly to GCS, and in a downstream task we do validation. We made this decision because the source we’re extracting from is flaky at best in terms of reliability, so we wanted minimal errors on our side between api response & downstream processing. This means we made a decision that we’d actively make our processing slower, but it works for us. I don’t see why reversing the order (validate after getting api response) would be a bad practice.
We’re using snowflake, and we have steps 4 and 5 bunched into one task. Our task creates a staging table (unique to this task), copies data in, does the merge into the actual table, then cleans up. We did this because we wanted the entire task to be idempotent, but looking back at it, I think that mentality came from our team being so heavily influenced by Airflow.
Not prefect specific, but one thing I would note is that if you aren’t using a task specific staging table (marked by date, uuid, or something else), having concurrent flow runs could result in some weird race conditions of both flows updating the staging table.
04/10/2020, 5:55 PM
Thanks for the reply Alex! It definitely makes sense to make validation a separate flow from ingestion 👍
We would also implement idempotency for our loads. I believe it is just good practice whenever possible.
04/10/2020, 5:57 PM
No problem! We actually have it just in a separate task, not an entirely different flow.