I have a quick question, I was thinking about usin...
# ask-community
d
I have a quick question, I was thinking about using preemptible machines. If I was using checkpointing and caching the results I guess it would just retry the flow?
k
Hi @Damien Ramunno-Johnson I guess if you use LocalResults it won't be able to get them but if you're doing something like this, S3Result or something like that is probably an option.
d
Yeah, we are normally using GCS for the results/cache
Big cost savings potentially
d
Just be careful to except your agent node from preemption
d
Okay, in our case the flows would be running themselves with the docker agent. I was thinking we would have several agents running so that it could retry on a different agent
But I guess the retrying the flow on a new agent like that isn’t supported? Also raises my second question. If an agent fails (lets say it went OOM, or something like that) would flows try to retry on a new agent?
I was thinking we could use a managed instance group that would make sure there was always 1 of the agents running (and use the autoscaling when needed)
k
I think the flows will not retry on a new agent automatically but if it were re-run, a new agent can pick up the flow yeah
Agents with shared settings are assigned the same id, so you can have an two instances be registered as the same agent
d
Any agent that can pick up a Flow Run (including one that previously was in a Failed state and is now in a Scheduled state) will attempt to do so. The way that you can configure which agents attempt to pull work for which Flows is through labels
You can have multiple agents in this scenario and put them on preemptible nodes
t
sounds like sound architecture Damien. This is a good usage of Prefect retries. There are some trip-ups doing this at scale with k8s (because k8s will also try to retry your jobs, less smartly), but if you're not using k8s, it should be straight-forward
you may not need 24/7 uptime on your agent--that's up to you. If you don't mind a little disruption/ some late flows, you should be able to run without multiple agents
d
Ah interesting, I got a managed instance cluster with autoscaling setup so that if a node goes down it will spin up again (if needed). I will try it with the retires. Thanks