Hi, I'm working with a larger than memory dataset...
# ask-community
t
Hi, I'm working with a larger than memory dataset. The operations needed can be easily parallelized and be split into chunks and I have done that with
.map()
. The problem is that when I run it the results are all still stored in memory (
bigger_than_mem
runs twice and keep resuslt in memory). Is it possible to have the flow the the data as a file, and clear the memory? Here is a sample flow
@task
def get_chunks():
return [[1,2],[3,4,5]]
@task
def bigger_than_mem(x):
return x*100000000
@task
def dump_to_db(x)
dumb(x)
with Flow("my_flow") as flow:
x = get_chunks()
x_trans = bigger_than_mem.map(x)
dump_to_db.map(x_trans)
k
Hi @Thomas Hoeck! Unfortunately, Prefect doesn’t handle garbage collection and this for Flows that pass a lot of data to each other. What users do is save a file to a location (like S3), and then pass the location of it to downstream tasks/Flows. You can then couple this with garbage collection.
j
Hi @Thomas Hoeck - The docs on result objects should help you here: https://docs.prefect.io/core/concepts/results.html#result-objects
t
Okay thank you. Just to be sure - when using results (where you write to disc) is the result still stored in memory or is the data garbage collected?
k
I actually think it’s still in memory so I would advice explicit garbage collection. You can still use our Results interface to read/write. Might be easier for things like S3 compared to handling boto3 on your own. Another way to look at this is to split your flows into subflows and use the
StartFlowRun
task.
t
I ended up saving to disc and returning a string of location, where the disc is a mounted Azure Fileshare. Works like a charm.
👍 1