Any idea how to prevent prefect from building up l...
# prefect-community
s
Any idea how to prevent prefect from building up large amounts of working memory? This is seen on a flow with 7 Tasks (6 of which are mapped over a list of ~40,000 elements), and running on a kubernetes agent
a
If I understand correctly, the recommended approach for such cases is to save the intended memory-intensive result out-of-band (e.g., to the flow/result storage) instead of returning it from the task, and return a reference to where the result was saved as the result of the task. Downstream tasks can then read the result as needed instead of it being kept in memory as the task result value for the duration of the flow run.
s
Thanks! I thought it might be something along those lines. Do you have any resources/examples in mind that can show this in more detail?
a
Not yet. I'm waiting for a compression-related pull request of mine to be approved and present in the next prefect release to try this myself. What I have been looking at is the docs on persisting Results, particularly the subsection on user-created Results: https://docs.prefect.io/core/concepts/results.html#persisting-results
b
We've been using prefect internally for a few months now for vanilla ETL work which is where this project shines. However, when kicking off large cross validation scenarios memory management has become a debilitating problem. This is not necessarily a bad thing as it has forced us to reevaluate our modelling architecture and process work flow to become single responsibility, i.e. Break modeling tasks into discrete nodes to release memory as soon as possible. This experiment marks the last push in our attempt to make the task match the tool. If the serialisation and memory management is not resolved we will have to resort to other tools to suite our needs.
On that note, I have read old discussions in the dask community about using arrow as a serialization layer. However this was abandoned at some point. From what I gather prefect uses pickle/cloud pickle for this. Would it make sense to explore arrow in this context, especially with respect to their in memory object store, or is this logic mostly based on the dask architecture?
s
The results from tasks are maintained in the running Flow state so downstream tasks can access results that they depend on (you can also inspect the results of each task separately later if desired; so it has to stick around somewhere). So, if a task returns a value with a large memory footprint, that memory usage will persist throughout the Flow run. I think this can be mitigated by using a
Result
class, but I've not bothered with `Result`s and just passed minimal pointers to external data sources.
s
thanks for the answers, everyone 🙂. @Spencer, at least in my case I only have one task which returns something (the first tasks, which builds a list of ~40k items). The other 6 tasks each are mapped over this list (round'a'bout 240,000 things to do in total), and do not return anything. So when only one task actually has a return, would you still expect the memory usage to climb so high?
s
What are the items? If each is a string, I wouldn't expect that much.
s
yes, strings
also, each string should be 12-14 characters. So the one list should only be ~300 kB 🤷
j
What is your executor? I’ve seen some weird memory issue when using DaskExecutor on Kubernetes. I actually opened a ticket for it, but forgot to do some more troubleshootting.. (Sorry Jim) Maybe it is related?
b
Running DaskExecutor on kubernetes with adaptive scaling. I apologize if my comments earlier seemed scathing. Just bureaucratic pressure to get things done at this point .
s
@Joël Luijmes, yes, I'm also using a KubernetesRun run_config, a DaskExecutor executor. Comparing the code you mention in your issue with mine (below), I can see quite a few similarities 🙂