m

    Matt Alhonte

    4 months ago
    Hrm, looks like the Client is dying shortly after launching the Cluster? It runs a few tasks, then right before mapping a big-ish one, the Client dies (cloudwatch logs just say
    Killed
    ) and the mapping stays in a
    mapped
    state
    Anna Geller

    Anna Geller

    4 months ago
    Can you share a bit more information about your use case?
    1. Are you on Prefect Cloud or Server? 2. Can you share the output of
    prefect diagnostics
    ? 3. I assume you mean a Dask Cloud provider Fargate cluster given you mentioned CloudWatch? 4. Why are you using Client directly - do you use it to make some API calls to Prefect backend before doing mapping? Can you share some flow code?
    m

    Matt Alhonte

    4 months ago
    By client I just mean the container that launches the Dask cluster. As in like, what gets launched by
    ECSRun
    . I'm on Cloud.
    prefect diagnostics
    output:
    {
      "config_overrides": {},
      "env_vars": [],
      "system_information": {
        "platform": "Linux-4.14.193-149.317.amzn2.x86_64-x86_64-with-glibc2.10",
        "prefect_backend": "cloud",
        "prefect_version": "1.2.0",
        "python_version": "3.8.8"
      }
    }
    Anna Geller

    Anna Geller

    4 months ago
    Can you share the flow code that gives you the behavior that doesn't match your expectations? when ECS Task ends with a red message and says something about exit - all containers finished, it may look like an error message, but it's ECS's way of saying that all work finished
    m

    Matt Alhonte

    4 months ago
    @Anna Geller It doesn't fail, it just stays in a
    mapped
    or
    pending
    state. And I see in the logs for the Scheduler that it says
    distributed.scheduler - INFO - Close client connection: Client-67d0a242-cb2b-11ec-8013-06a62f0d58e7
    Anna Geller

    Anna Geller

    4 months ago
    This looks like Dask client log. which logs do you see in the Prefect Cloud UI? Could you share the flow run ID?
    Kevin Kho

    Kevin Kho

    4 months ago
    How big is big ish? It looks like the scheduler is dying? Do you have access to the Dask dashboard?
    m

    Matt Alhonte

    4 months ago
    ecd087e5-821c-4f0c-a5b7-b6ea708acfbb
    Not sure how big! Just canceled my latest try, but I'll look at the Dask Dashboard when I try my next one (I think I should have access)
    Anna Geller

    Anna Geller

    4 months ago
    My understanding is that the dask cluster couldn't be created due to some misconfiguration (perhaps missing IAM roles?) and therefore task runs couldn't be submitted to the Dask executor. Given that the flow run had no submitted or running task runs due to this Dask-cluster-creation issue, Lazarus rescheduled the flow run and resubmitted everything for execution
    m

    Matt Alhonte

    4 months ago
    Interesting!
    It worked okay on a single node with
    LocalDaskExecutor
    ?
    Anna Geller

    Anna Geller

    4 months ago
    is it strictly necessary for you to use Dask cloud provider? perhaps you could try running your flow first with LocalDaskExecutor? Otherwise, can you cross-check the Dask cluster config and share with us if you can't find the issue in the cluster configuration on your own?
    yup exactly
    Kevin Kho

    Kevin Kho

    4 months ago
    local dask executor is dask alone though and DaskExecutor specifically uses
    distributed
    so itโ€™s not a good test
    m

    Matt Alhonte

    4 months ago
    Huzzah! So, there was 20mb NumPy array that I thought would be fine to pass to the tasks, but I just spilled it to a
    .npz
    file and now it all works fine!
    Kevin Kho

    Kevin Kho

    4 months ago
    20mb using a scatter or future?
    m

    Matt Alhonte

    4 months ago
    passed as part of a dictionary with some other stuff