https://prefect.io logo
r

Ralph Willgoss

10/26/2020, 2:32 PM
Hi, Thanks for all the help over the past few weeks/months. I've been evaluating prefect for a use case and right now it looks like it doesn't add much for what we do. I wanted to check with you guys, to see if I'm missing something or if I'm on the right track. I have a python model that is very disk intensive. We generate and move around lots of intermediate data, about 180GB, which is going to increase too. I currently run the model using LocalDaskExecutor on a single AWS EC2 instance with about 92 cores and 196GB RAM. While prefect gives me the ability to scale horizontally, if I were to spread the work horizontally, moving the intermediary data between instances is going to be slower than referencing off disk. So in summary, while I can just scale vertically our model seems limited by disk access so using something like prefect to go horizontal appears to add additional overhead. Thoughts, questions, opinions all welcome.
r

Raphaël Riel

10/26/2020, 3:00 PM
Could your data be somewhere in a centralized Storage (S3 in case AWS)? And every EC2 instance launched would read from there. Are you able to reference data by pointer instead of moving it around?
r

Ralph Willgoss

10/26/2020, 3:07 PM
Thanks @Raphaël Riel Yes I could but then id loose speed moving the data. I can read off NvME SSD's at about > 1.5GB sec. Doubt I'd get that from S3 Right now I use pointers, they point to disk.
r

Raphaël Riel

10/26/2020, 3:38 PM
I see. You’ll definitely get a penalty going through internet. But as far as I can see, you could get up to 3GB/s in some conditions. https://aws.amazon.com/premiumsupport/knowledge-center/s3-maximum-transfer-speed-ec2/
Otherwise, maybe that a Shared Volume (SAN) could be attached across your EC2s?
r

Ralph Willgoss

10/26/2020, 3:54 PM
thanks @Raphaël Riel, good food for thought. Do you have much similar experience?
r

Raphaël Riel

10/26/2020, 3:56 PM
I’m pretty good at AWS, but not on such big scale (EC2 that big, or such quantity of data)
👍 1
a

Avi A

10/26/2020, 4:03 PM
passing data between nodes in the same subnet (e.g. if they’re in the same Kubernetes clsuter) might be faster than you think, and you may not need EFS for that. Also I’m pretty sure you can get a single AWS machine with more RAM (on GCP I have a machine with 32 cores and >600GB
in any case this is not a problem that Prefect aims to solve (as far as I can see it)
r

Ralph Willgoss

10/26/2020, 4:06 PM
@Avi A ive noticed that there appears to be an overhead in prefect parsing around large objects. I saw lots of warnings about graph sizes. Hence, we moved to passing pointers to data around. We are evaluating prefect for our case, seeing where it fits in.
a

Avi A

10/26/2020, 4:07 PM
You’re right, but from what I can tell it’s a Dask issue
r

Ralph Willgoss

10/26/2020, 4:16 PM
@Avi A yeah exactly - hence, we've worked around it but that means ive been challenged on what exactly we are getting with prefect, considering that its something that needs to be learned and some aspects of it aren't always easy.