Hrm looks like the Client is dying shortly after launching t Prefect Community #ask-community

Hrm, looks like the Client is dying shortly after ...

Matt Alhonte

05/03/2022, 9:35 PM

Hrm, looks like the Client is dying shortly after launching the Cluster? It runs a few tasks, then right before mapping a big-ish one, the Client dies (cloudwatch logs just say

Killed

) and the mapping stays in a

mapped

state

Anna Geller

05/03/2022, 9:40 PM

Can you share a bit more information about your use case?

Anna Geller

05/03/2022, 9:43 PM

1. Are you on Prefect Cloud or Server? 2. Can you share the output of

prefect diagnostics

? 3. I assume you mean a Dask Cloud provider Fargate cluster given you mentioned CloudWatch? 4. Why are you using Client directly - do you use it to make some API calls to Prefect backend before doing mapping? Can you share some flow code?

Matt Alhonte

05/03/2022, 9:44 PM

By client I just mean the container that launches the Dask cluster. As in like, what gets launched by

ECSRun

. I'm on Cloud.

Matt Alhonte

05/03/2022, 9:44 PM

prefect diagnostics

output:

Copy code

{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Linux-4.14.193-149.317.amzn2.x86_64-x86_64-with-glibc2.10",
    "prefect_backend": "cloud",
    "prefect_version": "1.2.0",
    "python_version": "3.8.8"
  }
}

Anna Geller

05/03/2022, 9:49 PM

Can you share the flow code that gives you the behavior that doesn't match your expectations? when ECS Task ends with a red message and says something about exit - all containers finished, it may look like an error message, but it's ECS's way of saying that all work finished

Matt Alhonte

05/03/2022, 10:16 PM

@Anna Geller It doesn't fail, it just stays in a

mapped

pending

state. And I see in the logs for the Scheduler that it says

distributed.scheduler - INFO - Close client connection: Client-67d0a242-cb2b-11ec-8013-06a62f0d58e7

Anna Geller

05/03/2022, 10:18 PM

This looks like Dask client log. which logs do you see in the Prefect Cloud UI? Could you share the flow run ID?

Kevin Kho

05/03/2022, 10:19 PM

How big is big ish? It looks like the scheduler is dying? Do you have access to the Dask dashboard?

Matt Alhonte

05/03/2022, 10:20 PM

ecd087e5-821c-4f0c-a5b7-b6ea708acfbb

Matt Alhonte

05/03/2022, 10:21 PM

Not sure how big! Just canceled my latest try, but I'll look at the Dask Dashboard when I try my next one (I think I should have access)

Anna Geller

05/03/2022, 10:39 PM

My understanding is that the dask cluster couldn't be created due to some misconfiguration (perhaps missing IAM roles?) and therefore task runs couldn't be submitted to the Dask executor. Given that the flow run had no submitted or running task runs due to this Dask-cluster-creation issue, Lazarus rescheduled the flow run and resubmitted everything for execution

Matt Alhonte

05/03/2022, 10:40 PM

Interesting!

Matt Alhonte

05/03/2022, 10:41 PM

It worked okay on a single node with

LocalDaskExecutor

Anna Geller

05/03/2022, 10:41 PM

is it strictly necessary for you to use Dask cloud provider? perhaps you could try running your flow first with LocalDaskExecutor? Otherwise, can you cross-check the Dask cluster config and share with us if you can't find the issue in the cluster configuration on your own?

Anna Geller

05/03/2022, 10:41 PM

yup exactly

Kevin Kho

05/03/2022, 10:52 PM

local dask executor is dask alone though and DaskExecutor specifically uses

distributed

so it’s not a good test

Matt Alhonte

05/03/2022, 11:17 PM

Huzzah! So, there was 20mb NumPy array that I thought would be fine to pass to the tasks, but I just spilled it to a

.npz

file and now it all works fine!

🚀 1

Kevin Kho

05/04/2022, 12:11 AM

20mb using a scatter or future?

Matt Alhonte

05/04/2022, 12:13 AM

passed as part of a dictionary with some other stuff

👍 1

5 Views

Open in Slack

Previous Next