Hi Prefect wave Is there a way to not keep any data into the Prefect Community #ask-community

Hi Prefect :wave: Is there a way to not keep any d...

Newskooler

05/19/2021, 3:43 PM

Hi Prefect 👋 Is there a way to not keep any data into the

flows

folder (my folder is about 200GB of size. from just 2 days worth of flows)? Ideally I would like to delete the flows data once a flow is completed with 100% success.

Kevin Kho

05/19/2021, 3:48 PM

Hi @Newskooler! Are you persisting results in your flows?

Newskooler

05/19/2021, 3:50 PM

I don’t know. How can I tell if that’s the case?

Kevin Kho

05/19/2021, 3:52 PM

I mean like using

flow.result = something

or with a

@task(result = )

Newskooler

05/19/2021, 3:59 PM

I don’t have

flow.result

but have

flow.storage = Local(directory=storage_dir)

Kevin Kho

05/19/2021, 4:01 PM

Storage is just the script that’s being saved (in serialized form). When you go in it, what files are big in there?

Newskooler

05/19/2021, 4:03 PM

I have smth like this

prefect-result-2021-05-16t02-17-03-828171-00-00

Newskooler

05/19/2021, 4:03 PM

which seems to contain a lot of data from my ETL (hence why it’s taking so much space)

Kevin Kho

05/19/2021, 4:06 PM

You can try setting an environment variable

PREFECT__FLOWS__CHECKPOINTING = false

Kevin Kho

05/19/2021, 4:06 PM

More info here

Newskooler

05/19/2021, 4:07 PM

okay, in this case, if a flow fails partially (some task inside it fail) will I still be able to restart them at the end?

Kevin Kho

05/19/2021, 4:08 PM

No because it will look for data from a location that doesn’t exist (assuming your tasks pass the data around)

Kevin Kho

05/19/2021, 4:08 PM

In general, Prefect doesn’t go in and delete stuff for users at the moment.

Newskooler

05/19/2021, 4:09 PM

so the best I can do is to just delete this data 2-3 days later?

Newskooler

05/19/2021, 4:10 PM

once the flow has fully passed, etc?

Kevin Kho

05/19/2021, 4:10 PM

Yes I guess so if you need the restart functionality.

Newskooler

05/19/2021, 4:11 PM

I do, because sometimes not all task inside would pass.

Newskooler

05/19/2021, 4:12 PM

So my does prefect not just save meta - data? Such that if I were to restart, it will restart the job from A to Z (the dependency graph) vs now having to keep all data results. This way it would have to re-run some upper dependencies, but ti will be much lighter. Does that make sense or not really?

Kevin Kho

05/19/2021, 4:13 PM

Could you give me more details about the failure that happens sometimes? Is it like there’s an API that doesn’t work all the time?

Newskooler

05/19/2021, 4:17 PM

so I have a Flow which does a simple ETL and it’s mapped, where each map is independent of the other naturally. sometimes, one of the maps fails (a problem on my end not on Prefect’s end). In this case, I need to reset and re-run the whole map (that’s after I have used an auto-restart on each task and it has failed multiple times). In such cases out of 1k mapped tasks per flow, I would have 1-20 failed - so just restarting them from scratch is fine with me. Which means I don’t need any actual data being kept except meta dat (i.e. data which tells prefect which ones passed and which failed and what depends on what, such that it knows where to restart the run from). Currently this I cannot do - I always need to keep the data we discussed earlier. This data is quite heavy though, because my ETL is extracting data which is many GBs. so my Hard disk keep data I don’t need. I have a cronjob which cleans it once every day or two days, but it’s still a lot and more importantly - unnecessary. That’s my case. : )

Kevin Kho

05/19/2021, 4:21 PM

Have you explored caching ? You can cache based on the inputs that went into the task. If the task succeeded with those inputs, then it won’t run them again upon re-run (you specify how long the cache lasts).

Newskooler

05/19/2021, 4:22 PM

isn’t that across flows? (caching info from one flow - to the next)

Newskooler

05/19/2021, 4:23 PM

in my case I don’t pass any info between flows (so I don’t need caching - that’s my logic). it’s all an issue within a single flow.

Kevin Kho

05/19/2021, 4:25 PM

Caching is for single flow. We don’t have built-in mechanisms for passing data between flow. The cache is just an indicator that a task has been run already. If you use the

target

, that is a form of caching based on file persistence. Future runs will check for the existence of that file. But caching is just metadata of what went into the tasks

Newskooler

05/19/2021, 4:26 PM

So you are saying I can use caching and get rid of this

flow.storage = Local(directory=storage_dir)

Kevin Kho

05/19/2021, 4:27 PM

No no. Storage is different from the caching and checkpointing. Storage is just where your flow is saved in serialized form. When the flow starts, it goes in there and loads the flow from there. That’s not related to any DATA storage.

Kevin Kho

05/19/2021, 4:29 PM

Git Storage for example means, “load my file from Git”

Newskooler

05/19/2021, 4:35 PM

But that’s what’s causing my problem - the files there are many and huge.

Newskooler

05/19/2021, 4:35 PM

so just for 2 days I have about 12k files and they are a few mb each (some a few dozen mb)

Kevin Kho

05/19/2021, 4:35 PM

Is your flow code small enough to share?

Kevin Kho

05/19/2021, 4:36 PM

It’s the checkpointing that is causing data to be stored, not the

flow.Storage

Anurag Bajpai

05/19/2021, 4:43 PM

Something that's worked for us in a similar situation has been templating result locations. With an appropriate template, your results from subsequent flow runs should overwrite previous results rather than creating new files.

🚀 2

Kevin Kho

05/19/2021, 4:57 PM

This is a good suggestion @Anurag Bajpai!

8 Views

Open in Slack

Previous Next