Hello! I am wondering if there is any way to set a...
# ask-community
j
Hello! I am wondering if there is any way to set a job priority at the worker level? For background, our situation is this: • We use prefect to orchestrate an ML pipeline which is triggered by a consumer facing application. The pipeline starts whenever a user initiates a video upload to our site. • Flow 1 will copy the user video onto one of our GPU servers for processing. We have many servers so we choose a random one. All of the GPU servers have their own local drive which is NFS mounted to all of the other GPU servers. • After the upload is complete, we kickoff flow 2. Flow 2 will run our series of ML models on the video. We would like to send all of our GPU processing jobs to a single work pool and have a single worker on each GPU server accepting jobs from that pool. However, we would like the workers to prefer jobs where the video is already stored on their local drive. If there are no jobs available for local files, then the worker should start processing a video from one of the other servers over NFS. I can see that it is possible to create work-queues within each work pool and set a global priority for each work-queue. I also see that it is possible to filter the work-queues that a worker will pull from. I can't seem to find a way to make this fit our use case though. We would like one work-queue within our GPU processing pool for each GPU server, but we would need each worker to set it's own priority for the different queues. Maybe there is some other way to accomplish this that I am missing?
d
Hey Jacob! At the risk of getting too meta, whenever I encounter some more complex scheduling logic (similar to what you've laid out here), I often end up writing another flow to accomplish it. You have a set of steps here that make perfect sense in a new flow, the Router Flow: • Flow 1 kicks off the new Router Flow • The Router Flow accepts parameters that detail which server Flow 1 ran on and whether it stored files on that server • The Router Flow determines which work queue should run Flow 2, giving certain queues/servers affinity or choosing randomly as appropriate • The Router Flow runs create_flow_run_from_deployment and passes the work queue it chose
j
Hey Dylan, thanks for the response! We have found ourselves using a similar sort of "Router Flow" pattern in our app as well 🙂 We might be able to apply that here but It seems somewhat complicated. The problem we have is that the videos can vary in length by a lot. Some are 10-15 minutes, some are 10-12 hours. So we don't necessarily want our "Router" to make the decision ahead of time about which of our GPU servers the flow should be run on. We don't know how long each processing job will take to finish and we can only run 1-2 jobs at a time on each server due to resource constraints. So we don't want to run into a situation where we put a bunch of long videos into a single server's work-pool, and then have another server's work-pool with only shorter videos. One option I can see is to calculate the total sum of video hours that we have placed in each server's work-pool from within our Router Flow, and then try to take a guess about what is the "best" server's work-pool to place the next video into. This seems not ideal though. If one of our servers goes down for whatever reason, then those jobs will end up getting stranded in that particular server's work pool instead of getting picked up by another server's worker. It is also difficult to predict ahead of time when the next GPU will become available based on the length of the video alone. For example, if any processing job has to go through retry attempts then it can throw off our guesses about the next best server and lead to certain work-pools backing up while others are empty. It would be ideal if we could send all of our GPU jobs for all servers to a single work pool, and then have each worker on each particular server inform the prefect API about it's job preferences when it polls for a new job. I was hoping for something like: •
prefect worker start --name server-1 --pool gpu_jobs --work-queue-priority-config "{"server-1-queue": 1, "server-2-queue": 2}"
prefect worker start --name server-2 --pool gpu_jobs --work-queue-priority-config "{"server-1-queue": 2, "server-2-queue": 1}"
Then when we submit a flow run to the "gpu_jobs" pool, we set the queue to whichever server has the file stored locally.
d
So I don't think what you're hoping for is possible at the moment because I don't think we support that level of preference-setting. I suspect we would probably use some sort of tag affinity or preference as the mechanism for saying "we prefer to run here but will run somewhere else" if we were going to implement this. Would you mind writing up an issue on https://github.com/PrefectHQ/prefect and we'll move the discussion there?
j
Sure thing! Just filled one out - https://github.com/PrefectHQ/prefect/issues/13826
d
Thanks!