https://prefect.io logo
Title
n

Newskooler

12/10/2020, 3:28 PM
👋 What is the best advise for cleaning up
prefect/flows
so that we can keep it at a manageable size (it tends to just grow naturally; and it does so quite quickly)?
🤔
z

Zanie

12/10/2020, 3:50 PM
Hi! I’m checking in with the rest of the team about this 🙂
n

Newskooler

12/10/2020, 3:50 PM
Thank you. As i mention in another thready “I don’t know what’s safe to delete. This is not 
--volume-path
 but instead it’s the task 
flow.storage = Local(directory='/this/location)
” Also it has grown to 100GB for less than 30 days.
d

Dylan

12/10/2020, 4:39 PM
Hi @Newskooler is that the same location where you’re storing Flow Results?
n

Newskooler

12/10/2020, 4:41 PM
I am not sure I get the question @Dylan . So in my mind there are two locations : volume path and the flow.storage. I am saying I have issues with the flow.storage growing a lot. I guess this is the flow results, right?
z

Zanie

12/10/2020, 4:42 PM
flow.storage
is where the pickled flow is stored (when using
Local
) — the code of your flow
n

Newskooler

12/10/2020, 4:44 PM
So why does this grow so big?
z

Zanie

12/10/2020, 4:44 PM
When using
Local
storage, the flow is simply pickled with
cloudpickle
and dumped as bytes — do you have a ton of files there? What’s the smallest/largest file?
n

Newskooler

12/10/2020, 4:45 PM
I have a flow which runs every minute and that’s the one causing the issue
I do. I will check shortly and tell you. I did delete some of them to get the server running again haha
z

Zanie

12/10/2020, 4:46 PM
How often are you registering the flow?
n

Newskooler

12/10/2020, 4:47 PM
I register it about 10 times ; but don’t need to register it anymore
For the last 30 days that is
z

Zanie

12/10/2020, 4:49 PM
Hmm. This seems a bit peculiar. Each time it is registered, the storage is “built” which means the flow is written to the directory as “flow-name.prefect”
It should even be overwritten on subsequent registrations as far as I can tell
n

Newskooler

12/10/2020, 5:25 PM
so I have 100k files inside this dir (for the last 10 days). They look like so:
-rw-r--r-- 1 root root  1436873 Dec 10 17:24 prefect-result-2020-12-10t17-24-10-287987-00-00
-rw-r--r-- 1 root root  1179386 Dec 10 17:24 prefect-result-2020-12-10t17-24-10-629558-00-00
-rw-r--r-- 1 root root    13969 Dec 10 17:24 prefect-result-2020-12-10t17-24-15-435020-00-00
-rw-r--r-- 1 root root    16465 Dec 10 17:24 prefect-result-2020-12-10t17-24-15-821351-00-00
-rw-r--r-- 1 root root       94 Dec 10 17:24 prefect-result-2020-12-10t17-24-16-142481-00-00
This is the largest:
67836	./prefect-result-2020-12-08t18-57-57-583026-00-00
and this is the smallest:
4	./prefect-result-2020-11-30t00-00-15-595208-00-00
For reference they are orders of magnitudes larger than the data they have processed.
z

Zanie

12/10/2020, 5:28 PM
So looking into this a bit more and to clarify: if a task has checkpointing enabled (which is the default), it has a result which is written somewhere. The default result type for task is the result type for the flow which defaults to the result type associated with the flow storage you are using. In the
Local
storage class, the associated result type is defined as
result = LocalResult(self.directory, validate_dir=validate)
— this means that this directory will contain both your serialized flow and all of your flow run results.
n

Newskooler

12/10/2020, 5:29 PM
Okay, that’s by design, right and what does this mean in terms of what makes sense for me to do going forward?
z

Zanie

12/10/2020, 5:30 PM
Anything with
prefect-result-*
can be safely deleted if you do not need that data anymore. Furthermore, you can disable checkpointing for tasks if you don’t need the ability to resume.
n

Newskooler

12/10/2020, 5:32 PM
The resume is a super cool feature. I can set up a cron job to delete (only
prefect-result-*
) older than X days, does this make sense?
z

Zanie

12/10/2020, 5:47 PM
Yeah that makes sense.
You could even make a flow 😄
n

Newskooler

12/10/2020, 5:48 PM
True that. 😄 Thanks for the help!
On your end - do you see that as an issue which needs addressing or is that an expected behaviour?
z

Zanie

12/10/2020, 5:49 PM
We’d like them not to be stored alongside your flows — we’re looking into that, but I don’t think we want to decide when/what data should be pruned
n

Newskooler

12/10/2020, 5:50 PM
Okay - as long as it’s under your radar : )