Hi I have some questions related to disk space usa...
# ask-community
d
Hi I have some questions related to disk space usage on prefect database. Mostly related to how block documents are stored. Details will be added soon in the thread.
Ive noticed a significant disk usage increase in prefect database. I looked into the usage and found that
block_document
table is growing at an alarming rate. ~1-1.2 GB perday
Query I ran.
Copy code
select 
	bt."name" as block_type ,
	extract(month from bd.created) as month, 
	extract(day from bd.created) as day, count(1) 
from block_document bd , block_type bt  
where 
	bd.block_type_id  =bt.id 
group by 
	bt."name" ,
	extract(month from bd.created), 
	extract(day from bd.created)
order by block_type,month desc,day desc
Result.
Copy code
Azure	6	8	102171
Azure	6	7	252417
Azure	6	6	247796
Azure	6	5	251646
Azure	6	4	252944
Azure	6	3	252332
Azure	6	2	252440
Azure	6	1	253898
Azure	5	31	254765
Azure	5	30	254290
Azure	5	29	254048
Azure	5	28	254570
Azure	5	27	254046
Azure	5	26	254614
Azure	5	25	254965
Azure	5	24	253799
Azure	5	23	248840
Azure	5	22	247909
Azure	5	21	254712
Azure	5	20	255658
Azure	5	19	246496
Azure	5	18	252260
Azure	5	17	252821
Azure	5	16	252160
Azure	5	15	251961
Azure	5	14	252362
Azure	5	13	253300
Azure	5	12	252055
Azure	5	11	245653
Azure	5	10	251813
Azure	5	9	251273
Azure	5	8	231937
Azure	5	7	226121
Azure	5	6	221572
Azure	5	5	59424
Date Time	5	12	1
Date Time	5	11	1
Kubernetes Job	6	6	14
Kubernetes Job	6	5	14
Kubernetes Job	6	1	12
Kubernetes Job	5	31	1
Kubernetes Job	5	30	1
Kubernetes Job	5	26	1
Kubernetes Job	5	25	3
Kubernetes Job	5	24	11
Kubernetes Job	5	23	22
Kubernetes Job	5	19	23
Kubernetes Job	5	16	1
Kubernetes Job	5	15	1
Kubernetes Job	5	10	1
Kubernetes Job	5	8	14
Kubernetes Job	5	5	11
Local File System	6	8	942
Local File System	6	7	2176
Local File System	6	6	2183
Local File System	6	5	2198
Local File System	6	4	2198
Local File System	6	3	2191
Local File System	6	2	2214
Local File System	6	1	2064
Local File System	5	31	1912
Local File System	5	30	1901
Local File System	5	29	1898
Local File System	5	28	1893
Local File System	5	27	1909
Local File System	5	26	1911
Local File System	5	25	1899
Local File System	5	24	1895
Local File System	5	23	1900
Local File System	5	22	1905
Local File System	5	21	1903
Local File System	5	20	1904
Local File System	5	19	1191
Local File System	5	18	304
Local File System	5	17	304
Local File System	5	16	300
Local File System	5	15	301
Local File System	5	14	302
Local File System	5	13	304
Local File System	5	12	399
Local File System	5	11	294
Local File System	5	10	292
Local File System	5	9	297
Local File System	5	8	279
Local File System	5	7	253
Local File System	5	6	255
Local File System	5	5	79
The Azure blocks are being used for storage/ caching of tasks.
I have created a storage block using the UI.
But I see new block documents being generated every day.
Does prefect copy over the base block I have supplied and create a new block document for each cached task?
Looking for help / responses from devs @ prefect and ppl who know internal workings of Prefect. Or ppl who can comment on what I am doing wrong.
n
hey @Deceivious - how are you specifying your result storage with Azure? are you loading the same block each time and passing it into task decorators? or doing something else there's an idea of anonymous blocks that are registered when, for example, you instantiate an infra block and pass it to a deployment without explicitly calling
save
on it
d
Yes I am using the
.load()
method.
n
hmm
d
Ill wirte a bit more detail on my workflow.
1. A flow sets up all the required blocks and ensure they are present with correct naming. Basically this is our means of versioning the blocks. Blocks are versioned along with the flow version. 2. All the other flow that uses the blocks have tasks decorated as
@task(.....result_starage=Azure.load(<NAMEHERE>).....)
I do call .save() on flow #1 .
@Nate We use
cached_task
as decorator - just tasks that are primed with standard parameters.
a
@Deceivious Can you check to see how many block documents in your DB have
is_anonymous
set to
True
? When Prefect creates block documents on the user’s behalf, we set that equal to
True
, so if you have a large number of anonymous blocks, something within Prefect is likely causing your high number of block documents.
d
I think most of them is True
[v] is True.
@alex
n
how does the number of task runs compare to that anonymous block number (or total block number, since anon is dominating the number)? have you noticed that they grow in proportion?
d
Not all of our tasks are cached so a direct count wouldnt give the right data i guess.
---7338392
@Nate thats the number of tasks that I know are cached.
Also I noticed, we are passing the Azure.load(NAMEHERE) into a
wiih_options
. Would using
Task.with_options
cause this issue?
a
If using
result_storage
is causing the overzealous block saving (which is my hunch) then we’ll want to check if the block that you’re passing has
_block_document_id
set. If it doesn’t then that would prompt the results functionality to save a new block document.
d
I am passing the result of
Azure.load("name")
into the
result_storage
parameter. Would there be any cases where
Azure.load
would return object with no
_block_document_id
attribute?

https://prefect-community.slack.com/files/U03RN8W7DPU/F05B6J0QEB1/image.png

Here\s the actual code.
a
Called
.load
should always set
_block_document_id
on the returned object, but it might be getting taken off somewhere. This is worth checking because the presence of
_block_document_id
determines whether or not a new block document is created
🙌 1
d
@alex and @Nate Ive verified that we are using .load on all instances of our code. I will test this out with a minimal code and if I see the issues in that as well, ill share the code here.
Heres the minimum code.
Copy code
import datetime

import prefect
from prefect import flow,task
from prefect.filesystems import Azure
from prefect.serializers import JSONSerializer
from prefect.tasks import task_input_hash
from datetime import timedelta

def get_cache_parameters():
    cache_params = {
        "cache_key_fn": task_input_hash,
        "persist_result": True,
        "result_storage": Azure.load("test-block"),
        "result_serializer": JSONSerializer(jsonlib="json"),
        "cache_expiration":timedelta(minutes=1)
    }
    return cache_params

@task
def mock_test(i:int,now):
    prefect.get_run_logger().info(f"{i}{now}")


@flow
def flow():
    now= datetime.datetime.now()
    for i in range(10):
        mock_test.with_options(name=f"Name_{i}",**get_cache_parameters())(i,now)


flow()
test-block
Azure block must be pre-created. Everytime the code is run - a new block gets created in the
block_document
table.
Prefect version 2.10.12
Hi @alex and @Nate is this expected? or should I create an github issue regarding this?
Ive also verified that
_block_document_id
is present.
a
Yes, please create an issue and we’ll look into it further.