Hello! I am trying to evaluate if prefect will handle our use case and have a couple (hopefully quick) questions.
A little background:
We currently run long simulation (windows) executables that will produce some output files. The setup of these simulations is a directory of python files that the simulation knows how to run. Previously we would glob the directory and submit tasks that call subprocesses to execute the executable across Dask workers. Usually we create a directory full of these tasks and submit a directory (or a glob path of multiple directories) as Dask futures. Our current Dask setup is a single machine running the Dask scheduler, and multiple machines each running their own Dask workers with the cpu count controlling how many per machine. Jobs would be started through a cli and a glob path from any Dask client that could connect to the scheduler.
This worked reasonable well until upgrading Dask made the scheduler not behave and ended up keeping many workers idles for hours on end. After about of week of fighting with it, I’m reassessing our current path of Dask workers executing subprocesses.
https://dask.discourse.group/t/scheduler-not-saturating-workers/2076/2
From a quick glance at the documentation, it looks like I may be able to create a single “prefect worker process” on each machine that has a concurrency limit associated with the cpu count, the prefect server replacing the Dask scheduler on that machine, and transforming our cli globbing directories -> submitting subprocesses to Dask to a parameterized flow that calls the glob/subprocesses.
Is there a way to run a flow only once? Our simulations have no concept of a schedule and just need to run when submitted.
Our post processing of these simulations will poll for a while but once the task is over, does not need to look at those files any more, how would that best be handled?
Is there a better way to handle each worker machines CPU/Memory limits?
Do I need to start multiple work-queues/agents to fully parallelize running each subprocess or does will the concurrency of subprocesses behave fine? (If I do need multiple work queues, can I create a python script to do that or is it only cli calls to start them?)
If I want these workers to also run non-simulation tasks, would that require additional agents/workers?
Sorry for the wall of text! And thank you for your help!