https://prefect.io logo
s

simone

09/17/2020, 2:02 PM
HI #prefect-community I have task mapping on a list of images (~12000) :
out = func.map([A1, A2, A3, A4, A5, B1, B2, B3, B4, B5])
in the next step I would like to partially reduce the output and combine only the matching subgroups ex combining 
A = out[0:5] B =out[5::]
and then process in parallel 
A
 and`B`. I have three questions: (1) If I understood correctly order matters for mapping in prefect so input and output have the same order, correct? (2) I am running the code on a HPC. If I proceed this way will the entire 
out
 be collected in memory or the different output groups dispatched to the specific worker where the reduce is happening? (3) Is there a more efficient way to do this? thanks a lot!
d

Dylan

09/17/2020, 2:35 PM
Hi Simone, Here are some control flow resources: https://docs.prefect.io/core/task_library/control_flow.html
I don’t think it’s possible to take
out[5::]
because the result of the map doesn’t instantiate until runtime
You might be able to use
filter
from the control flow utilities instead
Usually when I run into memory problems in this way, I store the data in the cloud (S3 or GCS) and then pass around a list of references to data in cloud storage
s

simone

09/17/2020, 5:39 PM
Thanks a lot! I will look into filtering. i guess work case scenario i will have ~12000 files
d

Dylan

09/17/2020, 5:53 PM
You can always clean then up as part of your flow!
s

simone

09/17/2020, 6:26 PM
so just out of curiosity you do not thing that this much io will affect the speed of the processing? The flow runs on a SSD drive
d

Dylan

09/17/2020, 6:27 PM
It definitely will
You’re making a tradeoff between i/o, storage, and memory
If you need this to run under a certain time, then increasing the available memory and passing the whole list of images will be faster
If you can spare some time & disk, then writing the images to disk and passing references decreases the total memory use but increases i/o time
You can get even cleverer with Dask & arrays and whatnot
The DaskExecutor will help with some of this out-of-the-box, but ultimately the whole array will be in memory at some point I believe
s

simone

09/17/2020, 7:09 PM
great! thanks a lot for the thorough explanation! really appreciated! I will play around and see what is the best solution for my application.
👍 1