Hey I’m passing a fairly large object from one task and then map a following task over an iterable with that big object being unmapped.
Copy code
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
I was wondering, what would be the prefect pattern here to scatter the object ahead of time?
k
Kevin Kho
01/19/2022, 12:30 AM
Even if it worked, it’s not best practice since your client/scheduler needs to upload this large object to each of your workers. It would be better if this object were somewhere like S3 and the workers i independently pulled it from there.
Also, scatter specifically fails with autoscaling
p
Philipp Eisen
01/19/2022, 9:36 AM
Thanks for the reply.
Philipp Eisen
01/19/2022, 9:37 AM
Would I then have step in the task that would pull the object from a bucket?
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.