Hi Prefect :wave: Is there a way to not keep any d...
# ask-community
n
Hi Prefect đź‘‹ Is there a way to not keep any data into the
flows
folder (my folder is about 200GB of size. from just 2 days worth of flows)? Ideally I would like to delete the flows data once a flow is completed with 100% success.
k
Hi @Newskooler! Are you persisting results in your flows?
n
I don’t know. How can I tell if that’s the case?
k
I mean like using
flow.result = something
or with a
@task(result = )
n
I don’t have
flow.result
but have
flow.storage = Local(directory=storage_dir)
k
Storage is just the script that’s being saved (in serialized form). When you go in it, what files are big in there?
n
I have smth like this
prefect-result-2021-05-16t02-17-03-828171-00-00
which seems to contain a lot of data from my ETL (hence why it’s taking so much space)
k
You can try setting an environment variable
PREFECT__FLOWS__CHECKPOINTING = false
More info here
n
okay, in this case, if a flow fails partially (some task inside it fail) will I still be able to restart them at the end?
k
No because it will look for data from a location that doesn’t exist (assuming your tasks pass the data around)
In general, Prefect doesn’t go in and delete stuff for users at the moment.
n
so the best I can do is to just delete this data 2-3 days later?
once the flow has fully passed, etc?
k
Yes I guess so if you need the restart functionality.
n
I do, because sometimes not all task inside would pass.
So my does prefect not just save meta - data? Such that if I were to restart, it will restart the job from A to Z (the dependency graph) vs now having to keep all data results. This way it would have to re-run some upper dependencies, but ti will be much lighter. Does that make sense or not really?
k
Could you give me more details about the failure that happens sometimes? Is it like there’s an API that doesn’t work all the time?
n
so I have a Flow which does a simple ETL and it’s mapped, where each map is independent of the other naturally. sometimes, one of the maps fails (a problem on my end not on Prefect’s end). In this case, I need to reset and re-run the whole map (that’s after I have used an auto-restart on each task and it has failed multiple times). In such cases out of 1k mapped tasks per flow, I would have 1-20 failed - so just restarting them from scratch is fine with me. Which means I don’t need any actual data being kept except meta dat (i.e. data which tells prefect which ones passed and which failed and what depends on what, such that it knows where to restart the run from). Currently this I cannot do - I always need to keep the data we discussed earlier. This data is quite heavy though, because my ETL is extracting data which is many GBs. so my Hard disk keep data I don’t need. I have a cronjob which cleans it once every day or two days, but it’s still a lot and more importantly - unnecessary. That’s my case. : )
k
Have you explored caching ? You can cache based on the inputs that went into the task. If the task succeeded with those inputs, then it won’t run them again upon re-run (you specify how long the cache lasts).
n
isn’t that across flows? (caching info from one flow - to the next)
in my case I don’t pass any info between flows (so I don’t need caching - that’s my logic). it’s all an issue within a single flow.
k
Caching is for single flow. We don’t have built-in mechanisms for passing data between flow. The cache is just an indicator that a task has been run already. If you use the
target
, that is a form of caching based on file persistence. Future runs will check for the existence of that file. But caching is just metadata of what went into the tasks
n
So you are saying I can use caching and get rid of this
flow.storage = Local(directory=storage_dir)
?
k
No no. Storage is different from the caching and checkpointing. Storage is just where your flow is saved in serialized form. When the flow starts, it goes in there and loads the flow from there. That’s not related to any DATA storage.
Git Storage for example means, “load my file from Git”
n
But that’s what’s causing my problem - the files there are many and huge.
so just for 2 days I have about 12k files and they are a few mb each (some a few dozen mb)
k
Is your flow code small enough to share?
It’s the checkpointing that is causing data to be stored, not the
flow.Storage
.
a
Something that's worked for us in a similar situation has been templating result locations. With an appropriate template, your results from subsequent flow runs should overwrite previous results rather than creating new files.
🚀 2
k
This is a good suggestion @Anurag Bajpai!