Thread
#prefect-community
    jeff sadler

    jeff sadler

    2 years ago
    Hi, I've been trying to scale out my flow with a Dask Executor object. I worked through some initial pickling errors, but now I'm getting an error with no explanation. Is there a way to see where my flow if failing and get some more diagnostic info? Here's all that I see:
    [2020-06-23 16:56:03] INFO - prefect.FlowRunner | Beginning Flow run for 'run_model'
    [2020-06-23 16:56:03] INFO - prefect.FlowRunner | Starting flow run.
    distributed.scheduler - INFO - Receive client connection: Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
    distributed.core - INFO - Starting established connection
    distributed.scheduler - INFO - Remove client Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
    distributed.scheduler - INFO - Remove client Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
    distributed.scheduler - INFO - Close client connection: Client-6c8e597e-b572-11ea-ace9-ac1f6b4e4528
    [2020-06-23 16:56:05] INFO - prefect.FlowRunner | Flow run FAILED: some reference tasks failed.
    It's hard to know where to start with only
    Flow run FAILED: some reference tasks failed.
    Zachary Hughes

    Zachary Hughes

    2 years ago
    Hi @jeff sadler! Are you using server or cloud? If so, I'd expect to see some state history to indicate where your flow is failing.
    jeff sadler

    jeff sadler

    2 years ago
    Hi @Zachary Hughes - I am using prefect on an on-prem HPC. Does that answer your question?
    So I don't think I'm using "server." I know I'm not using "cloud"
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    By server/cloud we mean have you registered flows somewhere, or are you calling
    flow.run()
    . It sounds like you're doing the latter.
    Zachary Hughes

    Zachary Hughes

    2 years ago
    It does! My goal was to figure out if you were using any sort of orchestration layer. 🙂
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    What is your
    DaskExecutor
    configuration?
    jeff sadler

    jeff sadler

    2 years ago
    I am calling
    flow.run()
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Can you share the kwargs (if any) you used to create a
    DaskExecutor
    ?
    jeff sadler

    jeff sadler

    2 years ago
    Not sure what you mean by "DaskExecutor configuration". I just give it a tcp address of the scheduler
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    That's what I meant! How did you start that dask cluster?
    jeff sadler

    jeff sadler

    2 years ago
    hm. good question 😁
    we are approaching the borders of my understanding
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    That's fine! I'm mostly trying to help get logs here - the error that's occurring here likely was logged to the dask workers.
    jeff sadler

    jeff sadler

    2 years ago
    lemme see if I can find the logs
    To start the cluster, I ask the HPC for 5 nodes (1 scheduler, 1 interactive, and 3 workers) then I ask it to start the nodes in a singularity container. There is a
    dask-worker-space
    directory that gets created where I see what looks like a subdirectory for each of my workers. Each of those has a
    storage
    directory. But those are all empty. Does that help at all?
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    So you're manually creating the cluster with custom batch scripts?
    You might be interested in jobqueue.dask.org, which is a project for doing this for you in a consistent simple way.
    (provided your cluster type is one of the supported ones).
    jeff sadler

    jeff sadler

    2 years ago
    From what I understand, our HPC ppl created the singularity container which, when it starts, creates the cluster based on the allocated resources.
    I'm not sure what the script on the container is using to do that, but sounds like that's where jobqueue would fit in
    thanks for sharing that
    I will ping our HPC people to see if they can help me track down the worker log files
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Ah gotcha. dask-jobqueue wants to submit jobs for you programmatically as part of your client process. It provides a convenient Python api for setting things up, so you don't have to write and submit batch scripts yourself.
    My guess here is that the issue is an environment setup issue, but I'd hope we'd provide better logs to the local client in this case.
    One other option, you do get an output object from
    flow.run()
    (the state object), which has a
    .result
    attribute. If you iterate through that you may be able to inspect the errors on the failed tasks to learn what went wrong.
    jeff sadler

    jeff sadler

    2 years ago
    dask-jobqueue seems really powerful. I will need to look into that more. Before trying Prefect, I was using Snakemake to submit my batch scripts. Seems like dask-jobqueue fills that role.
    Do dask-jobqueue and Prefect work together?
    And thanks for the tip about
    result
    I'll look into that
    looking at the
    result
    gave me what I needed. Thanks! Idk if it's true for every case, but it could be useful to print out the individual task error beyond just the flow error. That's what helped me get back on my way.
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    That's a good point, we may want to add that. Thanks! Glad you got things working 😃