Hrm, looks like the Client is dying shortly after ...
# prefect-community
m
Hrm, looks like the Client is dying shortly after launching the Cluster? It runs a few tasks, then right before mapping a big-ish one, the Client dies (cloudwatch logs just say
Killed
) and the mapping stays in a
mapped
state
a
Can you share a bit more information about your use case?
1. Are you on Prefect Cloud or Server? 2. Can you share the output of
prefect diagnostics
? 3. I assume you mean a Dask Cloud provider Fargate cluster given you mentioned CloudWatch? 4. Why are you using Client directly - do you use it to make some API calls to Prefect backend before doing mapping? Can you share some flow code?
m
By client I just mean the container that launches the Dask cluster. As in like, what gets launched by
ECSRun
. I'm on Cloud.
prefect diagnostics
output:
Copy code
{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Linux-4.14.193-149.317.amzn2.x86_64-x86_64-with-glibc2.10",
    "prefect_backend": "cloud",
    "prefect_version": "1.2.0",
    "python_version": "3.8.8"
  }
}
a
Can you share the flow code that gives you the behavior that doesn't match your expectations? when ECS Task ends with a red message and says something about exit - all containers finished, it may look like an error message, but it's ECS's way of saying that all work finished
m
@Anna Geller It doesn't fail, it just stays in a
mapped
or
pending
state. And I see in the logs for the Scheduler that it says
distributed.scheduler - INFO - Close client connection: Client-67d0a242-cb2b-11ec-8013-06a62f0d58e7
a
This looks like Dask client log. which logs do you see in the Prefect Cloud UI? Could you share the flow run ID?
k
How big is big ish? It looks like the scheduler is dying? Do you have access to the Dask dashboard?
m
ecd087e5-821c-4f0c-a5b7-b6ea708acfbb
Not sure how big! Just canceled my latest try, but I'll look at the Dask Dashboard when I try my next one (I think I should have access)
a
My understanding is that the dask cluster couldn't be created due to some misconfiguration (perhaps missing IAM roles?) and therefore task runs couldn't be submitted to the Dask executor. Given that the flow run had no submitted or running task runs due to this Dask-cluster-creation issue, Lazarus rescheduled the flow run and resubmitted everything for execution
m
Interesting!
It worked okay on a single node with
LocalDaskExecutor
?
a
is it strictly necessary for you to use Dask cloud provider? perhaps you could try running your flow first with LocalDaskExecutor? Otherwise, can you cross-check the Dask cluster config and share with us if you can't find the issue in the cluster configuration on your own?
yup exactly
k
local dask executor is dask alone though and DaskExecutor specifically uses
distributed
so itโ€™s not a good test
m
Huzzah! So, there was 20mb NumPy array that I thought would be fine to pass to the tasks, but I just spilled it to a
.npz
file and now it all works fine!
๐Ÿš€ 1
k
20mb using a scatter or future?
m
passed as part of a dictionary with some other stuff
๐Ÿ‘ 1