s

    Stephen Lloyd

    2 months ago
    I need to scale out a Flow such that I significantly increase the parallelization. I’m already mapping tasks and adding memory to machines to be able to handle more threads. Supposing I didn’t want to keep scaling a single machine, is there a way to map tasks out to multiple machines? What do I need to be thinking about?
    Anna Geller

    Anna Geller

    2 months ago
    You may consider Dask or Ray - in 2.0 we have integration for both but if you want a bit longer answer: • Make sure you need a distributed execution environment • Try to determine how much compute capacity you need. Do you expect the need to remain static, or do you expect it to grow over time? • Make sure you have time to develop the expertise needed to set up and maintain a distributed computing environment. ◦ Do you know how to set up and run a compute cluster, e.g. using Kubernetes? If not, do you have time to learn? ◦ Do you have access to (or the budget to acquire) the VMs or servers you'll need to run your compute nodes? ◦ If you don't have the expertise and don't have time to learn, does your organization have the expertise and are they willing to use it on this project? • Determine whether you need to process data on-premises or if you can do it in the cloud. • Determine if your code is amenable to distributed computing. If you're doing multiprocessing using shared memory, or if your code assumes it will have easy access to resources stored on the local file system, you may need to refactor it to work in a distributed environment.