Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Untitled

Hello I’m trying a simple flow of doing linear regression in batches. The flow works when doing it sequentially but when I try with Dask backend it causes memory problems. What is confusing is that there is ample memory per worker. Can someone help me identify the problem? Am I doing something of a Dask anti-pattern somewhere?

Hi An,

A few questions:
• Are you running the above as a script (`python your_code.py`)?
• What OS are you on?
• What version of Python are you using?
• How large is the input data approximately? How much RAM is available for the workers?
• Can you describe a bit more about how it fails?

One common gotcha with pandas and dask is the `"mode.chained_assignment"`  option of pandas. By default it uses a lot of memory but it can be changed for a reduced footprint