Hey Dylan, thanks for the response! We have found ourselves using a similar sort of "Router Flow" pattern in our app as well 🙂
We might be able to apply that here but It seems somewhat complicated. The problem we have is that the videos can vary in length by a lot. Some are 10-15 minutes, some are 10-12 hours. So we don't necessarily want our "Router" to make the decision ahead of time about which of our GPU servers the flow should be run on. We don't know how long each processing job will take to finish and we can only run 1-2 jobs at a time on each server due to resource constraints. So we don't want to run into a situation where we put a bunch of long videos into a single server's work-pool, and then have another server's work-pool with only shorter videos.
One option I can see is to calculate the total sum of video hours that we have placed in each server's work-pool from within our Router Flow, and then try to take a guess about what is the "best" server's work-pool to place the next video into. This seems not ideal though. If one of our servers goes down for whatever reason, then those jobs will end up getting stranded in that particular server's work pool instead of getting picked up by another server's worker.
It is also difficult to predict ahead of time when the next GPU will become available based on the length of the video alone. For example, if any processing job has to go through retry attempts then it can throw off our guesses about the next best server and lead to certain work-pools backing up while others are empty.
It would be ideal if we could send all of our GPU jobs for all servers to a single work pool, and then have each worker on each particular server inform the prefect API about it's job preferences when it polls for a new job. I was hoping for something like:
•
prefect worker start --name server-1 --pool gpu_jobs --work-queue-priority-config "{"server-1-queue": 1, "server-2-queue": 2}"
•
prefect worker start --name server-2 --pool gpu_jobs --work-queue-priority-config "{"server-1-queue": 2, "server-2-queue": 1}"
Then when we submit a flow run to the "gpu_jobs" pool, we set the queue to whichever server has the file stored locally.