What would be the best practice in determining, if flow needs to run the ETL or not, depending on source data being stale or not.
Context: I've got a google sheet from which I'd load data into Snowflake, which works great, however I'd like to avoid communicating with Snowflake, if the GS has not been updated since the last successful flow run.
My options the way I see it are
• Store state somewhere externally (would require quite a bit of extra code and moving parts)
• During flow runtime, check previous flow run status and time of run
• Check when the GS was last updated and compare the delta between that and NOW() with the flow's scheduled interval retrieved either during runtime or hardcoded
04/30/2023, 11:37 AM
I don't know much about google sheets, but as I understand it there is a last-modified timestamp? The simple solution is to store the last-modified timestamp in a Prefect variable, check against it and update it if you find it has been modified.
If you load a lot of google sheets, or a lot of other data sources, I would take a look at some more dedicated E/L framework. I use Meltano from within Prefect, and I see Meltano has a google-sheets tap (although I haven't used it). https://hub.meltano.com/loaders/target-gsheet. This might take some more setup though, setting up a meltano project, and possibly use external state storage (S3, postgres or similar) if you use Docker.
04/30/2023, 1:38 PM
You have read my mind!
At first I did consider Meltano (i use it in some other use cases), but for this particular one it is little bit of overkill.
✅ prefect variables sounds perfect for this! Thank you!!!