Hi I ve been trying to scale out my flow with a Dask Executo Prefect Community #ask-community

Hi, I've been trying to scale out my flow with a D...

jeff sadler

06/23/2020, 5:04 PM

Hi, I've been trying to scale out my flow with a Dask Executor object. I worked through some initial pickling errors, but now I'm getting an error with no explanation. Is there a way to see where my flow if failing and get some more diagnostic info? Here's all that I see:

Copy code

[2020-06-23 16:56:03] INFO - prefect.FlowRunner | Beginning Flow run for 'run_model'
[2020-06-23 16:56:03] INFO - prefect.FlowRunner | Starting flow run.
distributed.scheduler - INFO - Receive client connection: Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove client Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
distributed.scheduler - INFO - Remove client Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
distributed.scheduler - INFO - Close client connection: Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
[2020-06-23 16:56:05] INFO - prefect.FlowRunner | Flow run FAILED: some reference tasks failed.

It's hard to know where to start with only

Flow run FAILED: some reference tasks failed.

Zachary Hughes

06/23/2020, 5:06 PM

Hi @jeff sadler! Are you using server or cloud? If so, I'd expect to see some state history to indicate where your flow is failing.

jeff sadler

06/23/2020, 5:08 PM

Hi @Zachary Hughes - I am using prefect on an on-prem HPC. Does that answer your question?

jeff sadler

06/23/2020, 5:08 PM

So I don't think I'm using "server." I know I'm not using "cloud"

Jim Crist-Harif

06/23/2020, 5:09 PM

By server/cloud we mean have you registered flows somewhere, or are you calling

flow.run()

. It sounds like you're doing the latter.

Zachary Hughes

06/23/2020, 5:09 PM

It does! My goal was to figure out if you were using any sort of orchestration layer. 🙂

Jim Crist-Harif

06/23/2020, 5:09 PM

What is your

DaskExecutor

configuration?

jeff sadler

06/23/2020, 5:10 PM

I am calling

flow.run()

Jim Crist-Harif

06/23/2020, 5:11 PM

Can you share the kwargs (if any) you used to create a

DaskExecutor

jeff sadler

06/23/2020, 5:11 PM

Not sure what you mean by "`DaskExecutor` configuration". I just give it a tcp address of the scheduler

Jim Crist-Harif

06/23/2020, 5:12 PM

That's what I meant! How did you start that dask cluster?

👍 1

jeff sadler

06/23/2020, 5:13 PM

hm. good question 😁

jeff sadler

06/23/2020, 5:13 PM

we are approaching the borders of my understanding

Jim Crist-Harif

06/23/2020, 5:15 PM

That's fine! I'm mostly trying to help get logs here - the error that's occurring here likely was logged to the dask workers.

👍 1

jeff sadler

06/23/2020, 5:16 PM

lemme see if I can find the logs

jeff sadler

06/23/2020, 5:21 PM

To start the cluster, I ask the HPC for 5 nodes (1 scheduler, 1 interactive, and 3 workers) then I ask it to start the nodes in a singularity container. There is a

dask-worker-space

directory that gets created where I see what looks like a subdirectory for each of my workers. Each of those has a

storage

directory. But those are all empty. Does that help at all?

Jim Crist-Harif

06/23/2020, 5:21 PM

So you're manually creating the cluster with custom batch scripts?

Jim Crist-Harif

06/23/2020, 5:21 PM

You might be interested in jobqueue.dask.org, which is a project for doing this for you in a consistent simple way.

👍 1

Jim Crist-Harif

06/23/2020, 5:22 PM

(provided your cluster type is one of the supported ones).

jeff sadler

06/23/2020, 5:22 PM

From what I understand, our HPC ppl created the singularity container which, when it starts, creates the cluster based on the allocated resources.

jeff sadler

06/23/2020, 5:23 PM

I'm not sure what the script on the container is using to do that, but sounds like that's where jobqueue would fit in

jeff sadler

06/23/2020, 5:23 PM

thanks for sharing that

jeff sadler

06/23/2020, 5:25 PM

I will ping our HPC people to see if they can help me track down the worker log files

Jim Crist-Harif

06/23/2020, 5:25 PM

Ah gotcha. dask-jobqueue wants to submit jobs for you programmatically as part of your client process. It provides a convenient Python api for setting things up, so you don't have to write and submit batch scripts yourself.

Jim Crist-Harif

06/23/2020, 5:26 PM

My guess here is that the issue is an environment setup issue, but I'd hope we'd provide better logs to the local client in this case.

Jim Crist-Harif

06/23/2020, 5:26 PM

One other option, you do get an output object from

flow.run()

(the state object), which has a

.result

attribute. If you iterate through that you may be able to inspect the errors on the failed tasks to learn what went wrong.

jeff sadler

06/23/2020, 6:09 PM

dask-jobqueue seems really powerful. I will need to look into that more. Before trying Prefect, I was using Snakemake to submit my batch scripts. Seems like dask-jobqueue fills that role.

jeff sadler

06/23/2020, 6:10 PM

Do dask-jobqueue and Prefect work together?

jeff sadler

06/23/2020, 6:10 PM

And thanks for the tip about

result

I'll look into that

jeff sadler

06/23/2020, 6:17 PM

looking at the

result

gave me what I needed. Thanks! Idk if it's true for every case, but it could be useful to print out the individual task error beyond just the flow error. That's what helped me get back on my way.

Jim Crist-Harif

06/23/2020, 6:18 PM

That's a good point, we may want to add that. Thanks! Glad you got things working :)

👍 1

Open in Slack

Previous Next