hi, i have a huge list of latitude and longitudes (over 1MM) and I am trying to download images corr...
b
hi, i have a huge list of latitude and longitudes (over 1MM) and I am trying to download images corresponding to each of these lat-lons and do a bunch of downstream processing on them. I am using a
map
to parallelise over the lat-lon pairs but i have a couple of questions here: 1. I know if i use a
DaskExecutor
or
LocalDaskExecutor
the flow is distributed, but is there any limit to applying
map
over such a large collection? 2. instead of using threads to run the computation, is it possible to make use of
async
as the tasks (most of them) are heavily IO bound? What are some considerations I should make here? Thanks!
e
1. I think prefect can handle maps over millions of items well, as long as the underlying dask cluster is sufficiently sized. 2. Yes, but you need to utilize async strictly inside your tasks. Split your lat-lon pairs into, say, batches of size 100. Then every task can download its own batch of size 100 asynchronously. This might help no 1 too, since you are reducing task count. About no 2, prefects unit of work is tasks. And dask wont run these tasks asynchronously, so async isnt a viable option to concurrently execute a lot of prefect tasks. But inside the task async can be utilized, no issues.
b
@emre thanks for the reply. are you proposing i take 100 lat-lon pairs as one atomic unit (i.e. task) and have those 100 pairs leverage async inside the task?
e
yeah, 1m / 100 = 10k so you would have 10k mapped prefect tasks in this case. 100 was a completely arbitrary number btw, try different batch sizes to see if performance improves or degrades.
b
yeah got that part. thanks for the help!
👍 1
k
Hey @Bishwarup B, what Emre said is right. Just note that task Results are held in memory after flows complete and having a high number of futures in Dask can be indeed to memory blowing up so batching would help with that.
👍 1