https://prefect.io logo
Title
a

Andreas Nigg

08/02/2022, 7:37 AM
Hey there, I've a general question about best practices for data transformations, but not related to prefect itself. I how hope, this channel is appropriate. We use prefect to coordinate ingestion of data to our warehouse (bigquery). From there, we use dbt to transform them as we need. One of our data imports is rather huge (let's say 100GB in total to make it easy). We use airbyte to daily ingest additional 1GB. This daily ingest also creates a lot of duplicates (so the 100GB table already contains some of the rows, which are inserted with the daily insert) - this is due to underlying data structure and not much we can do about it. How would you actually go ahead and deduplicate this data? I would like to prevent daily reading 100GB of data, just for deduplication. Any ideas for that? Thanks already in advance 😄 🚀
r

redsquare

08/02/2022, 8:02 AM
You could use Clickhouse as an intermediate step - Click will merge the duplicates down (you can choose to keep the existing rows) then insert only the new ones to BQ
airbyte has connectors to do this
a

Andreas Nigg

08/02/2022, 8:06 AM
Cool thanks for the swift answers 👍 We indeed use airbyte, however we came across some issues when using the deduplication feature, if the source is highly nested and uses json arrays. Do you by chance know, how airbyte implements this deduplication? They also use dbt under the hood, so I guess this would be applicable. I unfortunately was not able to find out how they do this.
r

redsquare

08/02/2022, 8:09 AM
I was referring to Clickhouse and its native deduplication with their 'replacing merge tree' engine -> https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree/
Not sure about Airbyte deduplication - not really used it a great deal
a

Anna Geller

08/02/2022, 10:54 AM
Airbyte has a feature to prevent that duplication during ingest - I'd encourage you to ask here: https://discuss.airbyte.io/