Hey there, I've a general question about best practices for data transformations, but not related to prefect itself. I how hope, this channel is appropriate.
We use prefect to coordinate ingestion of data to our warehouse (bigquery). From there, we use dbt to transform them as we need.
One of our data imports is rather huge (let's say 100GB in total to make it easy). We use airbyte to daily ingest additional 1GB. This daily ingest also creates a lot of duplicates (so the 100GB table already contains some of the rows, which are inserted with the daily insert) - this is due to underlying data structure and not much we can do about it.
How would you actually go ahead and deduplicate this data? I would like to prevent daily reading 100GB of data, just for deduplication. Any ideas for that? Thanks already in advance 😄 🚀