What’s the best practice for Data Retention Policy on Prefect deployment runs?
Just as a reference, here is how it is implemented for Apache Airflow, as yet another garbage collector DAG:
I’m sure that Prefect has either a built-in mechanism for that, or encourages a common idiom for rotating / archiving / deleting artifacts from old runs.
We have a persistent storage on Azure Blob Storage (the S3 equivalent) where we store artifacts (e.g. output files and images) from a Machine Learning (Kedro) run.
The space can pile up pretty quickly across runs and we would run out of storage, rendering our Prefect deployments not operational.
What kind of policies are recommended to evict data from old runs?
I don’t want to run out of space and I want the Prefect pipelines to remain operational.
I know that some of you would say: “_It depends_”, so for the sake of this example let’s imagine that I have a dedicated 256GB of storage.
Should I set a threshold (e.g. 70% of full) that will be as a trigger for evicting (removing) artifacts from old runs?
Also, when should this run? as the first (prerequisite) subflow in my bigger flow, or as yet another deployment in Prefect on a recurring schedule?