Yaron Levi
06/26/2024, 7:53 AM@flow(task_runner=EcsTaskRunner())
def my_weekly_huge_files_job(names):
files_on_s3 = get_list_of_huge_files()
for path,file_size in files_on_s3:
if file_size > xxx:
process_file.submit(path, memory_size='4gb')
else
process_file.submit(path, memory_size='1gb')
The code above would be super robust since a certain huge file might crash, but it won’t affect the other instances.
I believe such EcsTaskRunner() would be adopted by the community very quickly as many already use ECS for their ECS push work pools.
This opens up many possibilities for distributed compute on remote, separated, machines, without intruding big guns like Ray or Dask.
As a side note, maybe Dask and Ray are already much more accessible these days? I could swipe a credit card and use Coiled.io (managed Dask cluster) to get a very similar experience to what I’ve described above in the code. But still ECS would be much more common and affordable.
Am I missing something here?Will Raphaelson
06/26/2024, 3:51 PMYaron Levi
06/26/2024, 3:52 PM