Hi, I've been trying to scale out my flow with a D...
# prefect-community
j
Hi, I've been trying to scale out my flow with a Dask Executor object. I worked through some initial pickling errors, but now I'm getting an error with no explanation. Is there a way to see where my flow if failing and get some more diagnostic info? Here's all that I see:
Copy code
[2020-06-23 16:56:03] INFO - prefect.FlowRunner | Beginning Flow run for 'run_model'
[2020-06-23 16:56:03] INFO - prefect.FlowRunner | Starting flow run.
distributed.scheduler - INFO - Receive client connection: Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove client Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
distributed.scheduler - INFO - Remove client Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
distributed.scheduler - INFO - Close client connection: Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
[2020-06-23 16:56:05] INFO - prefect.FlowRunner | Flow run FAILED: some reference tasks failed.
It's hard to know where to start with only
Flow run FAILED: some reference tasks failed.
z
Hi @jeff sadler! Are you using server or cloud? If so, I'd expect to see some state history to indicate where your flow is failing.
j
Hi @Zachary Hughes - I am using prefect on an on-prem HPC. Does that answer your question?
So I don't think I'm using "server." I know I'm not using "cloud"
j
By server/cloud we mean have you registered flows somewhere, or are you calling
flow.run()
. It sounds like you're doing the latter.
z
It does! My goal was to figure out if you were using any sort of orchestration layer. 🙂
j
What is your
DaskExecutor
configuration?
j
I am calling
flow.run()
j
Can you share the kwargs (if any) you used to create a
DaskExecutor
?
j
Not sure what you mean by "`DaskExecutor` configuration". I just give it a tcp address of the scheduler
j
That's what I meant! How did you start that dask cluster?
👍 1
j
hm. good question 😁
we are approaching the borders of my understanding
j
That's fine! I'm mostly trying to help get logs here - the error that's occurring here likely was logged to the dask workers.
👍 1
j
lemme see if I can find the logs
To start the cluster, I ask the HPC for 5 nodes (1 scheduler, 1 interactive, and 3 workers) then I ask it to start the nodes in a singularity container. There is a
dask-worker-space
directory that gets created where I see what looks like a subdirectory for each of my workers. Each of those has a
storage
directory. But those are all empty. Does that help at all?
j
So you're manually creating the cluster with custom batch scripts?
You might be interested in jobqueue.dask.org, which is a project for doing this for you in a consistent simple way.
👍 1
(provided your cluster type is one of the supported ones).
j
From what I understand, our HPC ppl created the singularity container which, when it starts, creates the cluster based on the allocated resources.
I'm not sure what the script on the container is using to do that, but sounds like that's where jobqueue would fit in
thanks for sharing that
I will ping our HPC people to see if they can help me track down the worker log files
j
Ah gotcha. dask-jobqueue wants to submit jobs for you programmatically as part of your client process. It provides a convenient Python api for setting things up, so you don't have to write and submit batch scripts yourself.
My guess here is that the issue is an environment setup issue, but I'd hope we'd provide better logs to the local client in this case.
One other option, you do get an output object from
flow.run()
(the state object), which has a
.result
attribute. If you iterate through that you may be able to inspect the errors on the failed tasks to learn what went wrong.
j
dask-jobqueue seems really powerful. I will need to look into that more. Before trying Prefect, I was using Snakemake to submit my batch scripts. Seems like dask-jobqueue fills that role.
Do dask-jobqueue and Prefect work together?
And thanks for the tip about
result
I'll look into that
looking at the
result
gave me what I needed. Thanks! Idk if it's true for every case, but it could be useful to print out the individual task error beyond just the flow error. That's what helped me get back on my way.
j
That's a good point, we may want to add that. Thanks! Glad you got things working :)
👍 1