An Hoang

04/21/2020, 2:37 PM
Hello I’m trying a simple flow of doing linear regression in batches. The flow works when doing it sequentially but when I try with Dask backend it causes memory problems. What is confusing is that there is ample memory per worker. Can someone help me identify the problem? Am I doing something of a Dask anti-pattern somewhere?

Jim Crist-Harif

04/21/2020, 2:44 PM
Hi An, A few questions: • Are you running the above as a script (
)? • What OS are you on? • What version of Python are you using? • How large is the input data approximately? How much RAM is available for the workers? • Can you describe a bit more about how it fails?

David Ojeda

04/21/2020, 3:22 PM
One common gotcha with pandas and dask is the
option of pandas. By default it uses a lot of memory but it can be changed for a reduced footprint
☝️ 2