We are close getting our initial orchestration pipeline ported 🙂 we're a bit confused on how to get long jobs running. tips appreciated.
setup:
-- server 'ui': running the ui container
-- server 'gpu': running a prefect agent as well. registers with ui so it can pick up gpu jobs.
-- server 'nb': jupyter notebooks we're using to submit jobs. has a local prefect agent installed that points to 'ui' so we can submit jobs. notebooks often die
we can do quick one-offs fine. hurray!
tricky case 1: long historic job
we want to do a ~3 day job that processes 200 files, one at a time sequentially in sorted order. the problem is notebook server that runs the job will periodically stop, so we really want to submit a job like
seq([ task_1(file_1), task_2(file_2), ... task_n(file_n)])
. as soon as the meta-task is submitted, the notebook (and its local agent) can stop. however, for the next 3 days, we want those tasks to run one at a time, and we see status in the ui (incl. fails/retries). if we ever want to, we can rerun the flow to add/swap tasks.