Hey everyone - I’m trying to figure out what would be a good architecture on AWS for running a large number of flows in a burst (ex: X00,000 flows run once a week, broken into smaller batches). Ideally I’d want the backing infrastructure that these flows run on to be ephemeral, so it seems like I could use any of the following to do this:
• Spinning up more agents temporarily (current plan)
• Kubernetes jobs
• ECS (?)
• Dask + Fargate (?)
◦ I know Dask parallelism operates at task level rather than flow level
I’m wondering if anyone here has similar use cases. If so, what works well for you?
✅ 1
a
Anna Geller
10/21/2022, 3:43 PM
ECS Fargate using the ECSTask might be a really good and painless option
Example repo with a blog post and video demo
Anna Geller
10/21/2022, 3:44 PM
Fargate now supports up to 120 GB memory per a single container which may obviate the need to move to distributed compute with Dask cloud provider
k
Krishnan Chandra
10/21/2022, 3:46 PM
Thanks Anna! I actually had your repo open while making this thread 🙂
I’m curious about the memory point though - in my case I’d mainly be going distributed to parallelize compute more than anything else
🙌 1
a
Anna Geller
10/21/2022, 5:49 PM
I mean that earlier, when Fargate was supporting only small amounts of memory per container, you had no choice but go distributed, now you do have a choice to run things on a single but more powerful container (added just a couple of weeks ago), all serverless (no Ops) and without the costs of distributed coordination required by Dask
k
Krishnan Chandra
10/21/2022, 5:51 PM
Ah gotcha. That’s helpful too in case I need any super large jobs in the future
🙌 1
a
Anna Geller
10/21/2022, 5:54 PM
in case you're interested, this thread discusses the challenges of running Dask on Fargate -- might be easier to avoid unless really necessary. Sharing in case it may be helpful
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.