ll
06/04/2021, 1:43 PM# task 1
./my_cpp_executable <http://file1.xyz|file1.xyz>
...
# task 1000
./my_cpp_executable <http://file1000.xyz|file1000.xyz>
Each task takes about 4-8 compute hours on 4 CPUs/32G~ memory and our scheduled workloads take up about 20,000-40,000+ compute hours per day.
From what I can tell the only supported strategy for running a large batch of embarrassingly parallel tasks right now is to use Dask.
We have it working but I feel Dask is more oriented to (i) interactive analysis workloads, (ii) pure Python tasks, (iii) small jobs that fit onto local disk for each Dask node. Feels awkward to invoke a Dask executor for a one-line shell execution for a high-throughput, long-running, queued (num_tasks >> num_cluster_nodes) workload. We prefer not to have to support Dask on our infrastructure as it adds a whole other set of things that our sysengs have to maintain.
Seems more suitable if you supported any job queueing systems typically found in HPC environments like SGE, Slurm or HTCondor. I figure many of your target users in the fintech, scientific computing, meteorological space will already have SGE or Mesos cluster set up in their environment, but not a Dask cluster.Spencer
06/04/2021, 1:49 PMll
06/04/2021, 1:51 PMSpencer
06/04/2021, 1:52 PMll
06/04/2021, 1:52 PMSpencer
06/04/2021, 1:53 PMll
06/04/2021, 1:55 PMSpencer
06/04/2021, 1:57 PMll
06/04/2021, 2:00 PM