Hi all I m pretty new to distributed computing and have neve Prefect Community #ask-community

Hi all! I'm pretty new to distributed computing an...

John Jacoby

10/19/2021, 8:33 PM

Hi all! I'm pretty new to distributed computing and have never used Dask so please excuse me if this is a simple question. Is there a way to run Prefect (Prefect 1, not the new Orion) on a SLURM cluster where each task is submitted as a separate job? Is this something that could be achieved via the Dask executor or is this different functionality? I'm pretty comfortable with Python and OOP so if there's a way to create a custom executor that could do this, I would appreciate guidance on that as well. Thank you!

Kevin Kho

10/19/2021, 8:40 PM

Hey @John Jacoby, my very limited understanding is not quite because I think SLURM is really meant to be a jobqueue so it is different from a Dask Cluster. Here is what we can try though (very unsure this will work), you can create a Dask Executor like follows:

Copy code

executor=DaskExecutor(
    cluster_class="dask_kubernetes.KubeCluster",
    cluster_kwargs={
        "pod_template": make_pod_spec(
            image=os.environ["AZURE_BAKERY_IMAGE"],
            labels={
                "flow": flow_name
            },
            memory_limit=None,
            memory_request=None,
        )
    },
    adapt_kwargs={"maximum": 10}
)

So I am thinking the best bet is to use the

dask_jobqueue.SLURMCluster

in the same way with

Copy code

executor=DaskExecutor(
    cluster_class="dask_jobqueue.SLURMCluster",
    cluster_kwargs=...
)

and then attach this to your flow and see if it works? I suspect it won’t though because the SLURMCluster signature is quite different from the Dask clusters

Kevin Kho

10/19/2021, 8:43 PM

Actually maybe it will. The

JobQueueCluster

inherits from

SpecCluster

, which the

KubeCluster

does as well so we’ll see.

John Jacoby

10/19/2021, 8:48 PM

Hi @Kevin Kho, thanks for the response. I'll try this out. What is the 'pod_template' arg? Is that something I should include?

Kevin Kho

10/19/2021, 8:50 PM

That is specifically something the

KubeCluster

takes. So I put that as an example to give an idea of how to pass the

kwargs

that your SLURM Cluster would take. Do you spin it up on the fly or is it long running and you want to connect with IP address?

John Jacoby

10/19/2021, 8:55 PM

The cluster is always up and running. We connect to a login node through a password-protected ssh and submit jobs from there, either by hand or via a script. Not sure if that's exactly what you were asking, again this is my first rodeo with using an HPC cluster.

Kevin Kho

10/19/2021, 8:59 PM

That might be harder. The syntax I gave you is to spin up the cluster on the fly. It sounds like prefect may need to run on that machine. You can connect to a cluster using the address like:

Copy code

executor = DaskExecutor(address="192.0.2.255:8786")

but I guess not if it’s password protected. How do you connect to the cluster if you weren’t using Prefect? Like just plain Python? I think if you can give me that,, we can find a way to make this work. I’m not familiar with SLURM

John Jacoby

10/19/2021, 9:03 PM

Our sysadmins are fine with us running long-running scripts on the login node to do things like monitor the status of long-running jobs. The cluster and our personal workstations both have access to the same shared disk space. So without a workflow engine I would for instance put a bunch of job submission commands in a loop and run them with Python's subprocess module. Then run the Python script from the login node.

Kevin Kho

10/19/2021, 9:05 PM

Gotcha because you may not need SLURM has the executor, with task, maybe you can just use the Prefect task to submit tasks to the SLURM cluster, know what I mean?

John Jacoby

10/19/2021, 9:12 PM

Right I forgot to mention the motivation for this question is that I want to make use of Prefect's ability to run tasks in parallel but threading is disabled on cluster machines (apparently this is HPC best-practice, according to our sysadmins). So submitting a job is blocked until the job before it is complete, at which point that's just the same as running locally. I could use multi-processing, but then the number of jobs is limited to the number of processors I use, which really shouldn't be a lot on a login node that's not meant for analysis. I guess now that I type this out it sort of sounds impossible. I was hoping there would be a way to use the login node as a coordinator for the different jobs.

John Jacoby

10/19/2021, 9:13 PM

Like I said, I'm new to distributed computing. I couldn't tell if I'm in a strange niche situation or if there was an obvious answer that I just didn't know about. It's sounding more like the former though.

Zanie

10/19/2021, 9:29 PM

You can run more processes than the number of processors so I'm not sure that should limit you.

Zanie

10/19/2021, 9:30 PM

Copy code

scheduler="processes", num_workers=1000

or similar might unblock you?

Zanie

10/19/2021, 9:32 PM

I've also never heard of disabling OS threads on an HPC cluster. I've heard of disabling hyper-threads (in which your number of CPUs appears to be doubled because they each have two execution threads), but you should be able to use OS threads still.

John Jacoby

10/19/2021, 9:47 PM

Oh that's interesting. I'm using logging to monitor how my tasks are running. When I used the LocalDaskCluster on my machine it was easy to see a group of tasks starting together and then some time passed and then they finished together. But running the same code on the cluster, I saw them running sequentially. If I could figure out threading that would render the rest of this unnecessary.

John Jacoby

10/19/2021, 9:48 PM

I'll try the processes with more workers too.

Zanie

10/19/2021, 9:49 PM

(It might be bold to start 1000 process workers, it might be slow to create that pool unless they're created on demand)

Zanie

10/19/2021, 9:52 PM

Perhaps we failed to infer the number of processors correctly, you may want to specify more workers with

scheduler="threads"

still and see if that clears up your issue.

John Jacoby

11/15/2021, 3:54 PM

Sorry for the late reply, but jsyk explicitly setting the number of workers worked like a charm. Thank you!

5 Views

Open in Slack

Previous Next