Hey there, I've a general question about best practices for data transformations, but not related to prefect itself. I how hope, this channel is appropriate.
We use prefect to coordinate ingestion of data to our warehouse (bigquery). From there, we use dbt to transform them as we need.
One of our data imports is rather huge (let's say 100GB in total to make it easy). We use airbyte to daily ingest additional 1GB. This daily ingest also creates a lot of duplicates (so the 100GB table already contains some of the rows, which are inserted with the daily insert) - this is due to underlying data structure and not much we can do about it.
How would you actually go ahead and deduplicate this data? I would like to prevent daily reading 100GB of data, just for deduplication. Any ideas for that? Thanks already in advance 😄 🚀
r
redsquare
08/02/2022, 8:02 AM
You could use Clickhouse as an intermediate step - Click will merge the duplicates down (you can choose to keep the existing rows) then insert only the new ones to BQ
redsquare
08/02/2022, 8:02 AM
airbyte has connectors to do this
a
Andreas Nigg
08/02/2022, 8:06 AM
Cool thanks for the swift answers 👍
We indeed use airbyte, however we came across some issues when using the deduplication feature, if the source is highly nested and uses json arrays. Do you by chance know, how airbyte implements this deduplication? They also use dbt under the hood, so I guess this would be applicable. I unfortunately was not able to find out how they do this.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.