hey all, i have a question about infrastructure fo...
# prefect-community
j
hey all, i have a question about infrastructure for prefect 2. my use case is that in general, we run a small number of flows daily that process relatively low volumes of data. they take minutes to less than an hour to complete. occasionally, but not very often, our data volume spikes, and it takes hours to complete. right now we are on prefect 1, deployed on ECS. we have one agent as an ECS task that we've provisioned low compute. it is slow; we've throttled its amount of concurrent flow executions to avoid it freezing; but it works. we need the ability to run our pipelines faster, and the imminent migration to prefect 2 is the right moment to figure this out. the docs recommend we jack up the ECS task's compute. but 1) we don't know our volume in advance so a spike could require more than we expected, and 2) it doesn't make sense to provision a crazy amount of compute when we only need it infrequently. i have the aws lambda model in my head, where scaling horizontally is a piece of cake. does either the docker or k8s infra implementation support this type of thing, e.g. supports N containers concurrently; spins up and kills a container for each flow run? or perhaps some combo of ecs with fargate? what's the best approach to horizontal scaling with prefect 2? or is this not the right approach at all? worth mentioning here is that we use two different flow patterns: 1. one flow that does all processing << takes a long time 2. subflow pattern, where parent flows delegate chunks to work to child flows << seems to be quicker, but we quickly ran into concurrency issues with multiple flows running side by side the plan is to migrate what we can to approach #2.
c
Personally, I’m a fan of Kubernetes, so my answer is slightly biased here, but I think it also addresses the issue. You can run a small, scalable nodepool - when you don’t need the compute it can scale down to 1. When you need the compute, it can scale up as required. When you submit many jobs, they will be scheduled and execute, as many as your nodepool has the resources to schedule, and any others will go PENDING until its their turn. Once your jobs complete, the pods terminate, and the nodepool can scale back down
j
When you need the compute, it can scale up as required.
how does k8s know when to scale up? is this something we can set in prefect? or do we need to set it in a k8s config?
how would this be different than autoscaling the ECS task for the prefect agent?
k
Hi Jon, I'm evaluating Prefect as an orchestration framework and found out that infrastructure-wise there is not much documentation around scaling the agent. I also face the same issue like yours where there are spikes in activities and we really want the agent to scale up within seconds (think AWS Lambda, as we found EKS to be not scaling fast enough). I wonder if you've reached any conclusion from your discussion above?
c
Generally speaking , you either have available capacity costing money , or you need to scale up which takes time . The agents are very lightweight , so scaling horizontally on existing infra is trivial . Yes, if you have to scale in new nodes of anything compute , it will take some time. Regarding the “when” to scale , that’s decided by you at both the node (eks) level to scale the node pool based on usage , in addition to the agent which can scale based on cpu usage
1
j
@Christopher Boyd can you provide an example infra where this is trivial:
The agents are very lightweight , so scaling horizontally on existing infra is trivial .
@Kam-ting Tsoi we have not implemented this, but an option is to use an agent on ECS task that spins up a container on fargate for each flow: https://towardsdatascience.com/prefect-aws-ecs-fargate-github-actions-make-serverless-dataflows-as-easy-as-py-f6025335effc#:~:text=The[…]ECS%20tasks
1
downsides to this that i anticipate: 1. how to pass data from flow to flow the subflow pattern 2. startup times for fargate
c
@jon on a VM you can just run a bunch of agents as services . I’m kubernetes , you can run any number of agents as pods , just deploying the helm chart or manifest with your queue . If you want to scale, you can scale the replicaset. I’m not as knowledgeable using ECS specifically , but it shouldn’t be much different than just spinning up the number of task definitions you need to run
I don’t really see or understand the situation where time is critical but you want the flexible option to scale down
j
on a VM you can just run a bunch of agents as services .
ah, so on prefect 2 an acceptable approach is to spin up more agents? on prefect 1 this does not work