hello one and all I m trying to analyse the memory consumpti Prefect Community #ask-community

hello one and all I’m trying to analyse the memory...

Christoph Wiese

03/29/2021, 1:07 PM

hello one and all I’m trying to analyse the memory consumption of one of my flows, but I cannot figure out how to add guppy to my flow. My attempt was to create two tasks: one to create the hpy object and another to print it. I added dependencies to make the object creator the first task in the flow and the printer the last task in the flow. The heap is actually printed, but only for the task that created the hpy object. Does anybody know how I can have the hpy() object cover the entire flow? (or if there is an alternative to using guppy, that’d be fine as well)

Dylan

03/29/2021, 2:22 PM

Hi @Christoph Wiese! Can you share your Flow’s configuration code (Run Config and Executor) and a rough outline of your Flow? I’ve not used guppy before but I’m happy to take a quick look at the docs

Christoph Wiese

03/29/2021, 2:27 PM

Hi @Dylan 👋 I’m using UniversalRun and the LocalDask Executor:

flow.run_config = UniversalRun(labels=[f"data-{self._config.env}"])

and

self._executor = LocalDaskExecutor(scheduler="threads", num_workers=15)

Christoph Wiese

03/29/2021, 2:29 PM

The flow is relatively simple, it syncs files between an SFTP server and an S3 bucket. First it gets a secret from AWSSecretsManager, uses that to list file om an SFTP server using a task I wrote using paramiko, a separate task lists all files in an S3 bucket, then I’m comparing both file list and map all files that are not in the S3 bucket to an SFTP download task, which is followed by an S3Upload task

Christoph Wiese

03/29/2021, 2:30 PM

I’d be happy to share more code if it’d be helpful 🙂

Dylan

03/29/2021, 2:32 PM

Ahh. I’m not sure how guppy works, but I do know that your tasks are likely executed in new threads created by the executor. You’ll need to make sure guppy is run inside each task

Dylan

03/29/2021, 2:33 PM

from guppy import hpy; h=hpy()

I suspect that needs to be called in each task you’re interested in

Christoph Wiese

03/29/2021, 2:37 PM

I tried that as well and that work s for the individual tasks, which seem fine individually - but the overall flow keeps accumulating memory, until it eventually crashes the system 😬

Dylan

03/29/2021, 2:39 PM

Have you tried adding more memory 😉

Dylan

03/29/2021, 2:39 PM

Kidding

Christoph Wiese

03/29/2021, 2:39 PM

Am I right in assuming that the memory of mapped tasks is freed up once that child and all it dependent children have finished their run?

Dylan

03/29/2021, 2:39 PM

What agent are you using to run this?

Christoph Wiese

03/29/2021, 2:40 PM

Yeah, that is actually what I’m playing with now, switching to the ECSRunner and trying bigger instance 😅

Christoph Wiese

03/29/2021, 2:41 PM

The original agent was a fargate with 1 vcpu and 3gb ram - my latest attempt was

flow.run_config = ECSRun(

labels=[f"data-{self._config.env}"], cpu=2048, memory=16384

Christoph Wiese

03/29/2021, 2:41 PM

thats 2 vCPUs in AWS parlance

Dylan

03/29/2021, 2:42 PM

Gotcha

Christoph Wiese

03/29/2021, 2:42 PM

and just for reference, I’m syncing about 25gb of data in about 2500 files

Christoph Wiese

03/29/2021, 2:43 PM

i.e. thats about 5k children in the flow

Christoph Wiese

03/29/2021, 2:43 PM

one child for the download, one for the upload

Dylan

03/29/2021, 2:49 PM

So Task results from all tasks (mapped or not) remain in memory for the duration of the flow

Christoph Wiese

03/29/2021, 2:49 PM

Christoph Wiese

03/29/2021, 2:50 PM

woah, that explains it

Christoph Wiese

03/29/2021, 2:51 PM

my understanding was that flows should be designed such that we could easily transition to running them on a cluster, i.e. directly pass data between tasks instead of using local files

Christoph Wiese

03/29/2021, 2:51 PM

did I get that wrong?

Christoph Wiese

03/29/2021, 2:55 PM

should I be doing a flow of flows here, where I map chunks of the workload to child flows? (Though I’d find that a bit excessive, since effectively I’d be mapping my workload twice, once on the flow level and then on the task level)

Dylan

03/29/2021, 2:57 PM

When I’m dealing with a large volume of data, I often put everything into cloud storage and pass references to it between tasks, so that my flow only has the data I’m working on in memory and tasks pass references to that data between each other

Christoph Wiese

03/29/2021, 3:00 PM

hm, I really misunderstood the point of the task result snapshots then

Christoph Wiese

03/29/2021, 3:01 PM

I thought they should contain the data and the S3 result handler would upload them so that I could later resume if need be

Christoph Wiese

03/29/2021, 3:02 PM

but understood, in that case I actually need to do the download and upload in one task

Dylan

03/29/2021, 3:38 PM

Hi Christoph, I’m juggling a few conversations at the moment but I’m also trying to dig in further. My understanding on how task results do or don’t get release from memory isn’t stellar, and I want to make sure I’m giving you the best information thatI can

Dylan

03/29/2021, 3:39 PM

Let me discuss with the team a bit more and try to provide some more insight

Christoph Wiese

03/29/2021, 3:50 PM

that would be great, I’ll wait for that before I begin re-writing my flows 🙂

👍 1

Christoph Wiese

03/30/2021, 4:07 PM

Hi @Dylan, just wondering if you had a chance to talk to the team about the way memory is managed for a flow?

Dylan

03/30/2021, 4:37 PM

Not just yet, will get back to you 👍

👌🏻 1

Christoph Wiese

04/06/2021, 11:27 AM

@Dylan just wondering if there is an update? I don’t mean to rush you, just have a ticket blocked by this and I’m wondering how to proceed

Dylan

04/06/2021, 3:09 PM

Hey @Christoph Wiese

Dylan

04/06/2021, 3:10 PM

Got a little more clarity here

Dylan

04/06/2021, 3:10 PM

The short answer is that Prefect doesn’t do memory management, so it’s the same behavior that you might see if you wrote a script yourself. Variables will be garbage collected once Python determines there are no more references to that variable — all task return values will be referenced via state objects and therefore will live in memory for the duration of the flow run. Running a workflow on a dask cluster helps because any intermediate memory usage within your task (that isn’t referenced in your final return value) will be run on a dask worker and cleaned up more aggressively, but the final return value will still get collected into memory back in the main flow runner process

Dylan

04/06/2021, 3:15 PM

If you’re using a LocalDaskExecutor, using process workers will clean up intermediate memory but thread workers will not

Christoph Wiese

04/06/2021, 3:49 PM

That is very helpful, thanks @Dylan!

Dylan

04/06/2021, 4:17 PM

Anytime!

3 Views

Open in Slack

Previous Next