Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi.
Im creating my first flow - which is an daily ETL flow that reads data from csv-files and writes the data to big query.
But now i wonder if there are any recommended ways to safeguard that duplicates aren’t written to big query if the flow it’s executed twice.
I was thinking of using the cached_key_fn to not rerun the write task but feel unsure if that’s how it’s supposed to do. (I would rather have the task to be skipped..)

this is usually something you tackle on the SQL side - something like this <https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement>

You can use the <https://docs.prefect.io/orchestration/concepts/kv_store.html|KV Store> to keep track of already processed records

Must say that this is one of the best community forums i’ve experienced!