Hi I m working with a larger than memory dataset The operati Prefect Community #ask-community

Hi, I'm working with a larger than memory dataset...

Thomas Hoeck

04/27/2021, 10:15 AM

Hi, I'm working with a larger than memory dataset. The operations needed can be easily parallelized and be split into chunks and I have done that with

.map()

. The problem is that when I run it the results are all still stored in memory (

bigger_than_mem

runs twice and keep resuslt in memory). Is it possible to have the flow the the data as a file, and clear the memory? Here is a sample flow

@task

def get_chunks():

return [[1,2],[3,4,5]]

@task

def bigger_than_mem(x):

return x*100000000

@task

def dump_to_db(x)

dumb(x)

with Flow("my_flow") as flow:

x = get_chunks()

x_trans = bigger_than_mem.map(x)

dump_to_db.map(x_trans)

Kevin Kho

04/27/2021, 1:29 PM

Hi @Thomas Hoeck! Unfortunately, Prefect doesn’t handle garbage collection and this for Flows that pass a lot of data to each other. What users do is save a file to a location (like S3), and then pass the location of it to downstream tasks/Flows. You can then couple this with garbage collection.

Jenny

04/27/2021, 1:30 PM

Hi @Thomas Hoeck - The docs on result objects should help you here: https://docs.prefect.io/core/concepts/results.html#result-objects

Thomas Hoeck

04/28/2021, 1:35 PM

Okay thank you. Just to be sure - when using results (where you write to disc) is the result still stored in memory or is the data garbage collected?

Kevin Kho

04/28/2021, 3:18 PM

I actually think it’s still in memory so I would advice explicit garbage collection. You can still use our Results interface to read/write. Might be easier for things like S3 compared to handling boto3 on your own. Another way to look at this is to split your flows into subflows and use the

StartFlowRun

task.

Thomas Hoeck

06/01/2021, 8:46 AM

I ended up saving to disc and returning a string of location, where the disc is a mounted Azure Fileshare. Works like a charm.

👍 1

Open in Slack

Previous Next