Michael Kennely
12/17/2020, 8:10 PMping source system for dataset -> send dataset to pandas DF -> store dataset as a file in S3 -> push contents of said file to our datastore
I’m using the LocalDaskExecutor
to do DFE and that’s helped a ton as the flow used to fail at the first step because it was doing Breadth first and holding all of the datasets in memory. I expected that once the task tree was completed for each dataset that it’d release that memory but that doesn’t seem to be the case. Am I doing something wrong/poorly or is there a way to free up that memory once the terminal task has completed for each branch?Dylan
Michael Kennely
12/17/2020, 8:37 PMtop
command running on the machine that I have my agent installed on and there’s a process that’s listed as prefect
and it’s %MEM
is continually increasing while the available memory on that machine continually decreases.Dylan
Michael Kennely
12/17/2020, 8:39 PMDylan
Michael Kennely
12/17/2020, 8:55 PMDylan
Michael Kennely
12/17/2020, 8:59 PMflow.environment = LocalEnvironment(
executor=LocalDaskExecutor()
)
flow.storage = GitHub(redacted)
Zanie
dask
is managing the values. We’ll continue to think about this in the future, but because of this complexity we can’t promise a perfect solution right now. The best recommendation I have for this is to reduce the amount of data that needs to be passed between tasks. Passing large amounts of data from task to task won’t scale to a distributed setting anyway. I’d recommend moving your tasks that use the dataframe into a single task. Alternatively, you can write the dataframe to disk and pass the filepath to the next task.Michael Kennely
12/17/2020, 9:26 PM