Hi, I'm seeing unusual errors. while using prefec...
# prefect-server
a
Hi, I'm seeing unusual errors. while using prefect with dask cluster. here is a more information: https://github.com/PrefectHQ/prefect/issues/5252
Copy code
flow.executor = prefect.executors.DaskExecutor(
            cluster_class=lambda: KubeCluster(
                pod_template=make_pod_spec(
                    memory_request="64M",
                    memory_limit="4G",
                    cpu_request="0.5",
                    cpu_limit="8",
                    threads_per_worker=24,
                    image=prefect.context.image,
                ),
                deploy_mode="remote",
                idle_timeout="0",
                scheduler_service_wait_timeout="0",
                env=dict(os.environ)
                | {
                    "DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD": "fork",
                    "DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES": "100",
                },
            ),
            adapt_kwargs={"minimum": min_workers, "maximum": max_workers},
        )
a
@Ahmed Ezzat thanks for the detailed information in the issue. I don’t see why this may occur, will ask the team. Meanwhile, do you get the same error when using
LocalDaskExecutor
or does it only happen with Dask?
also: do your tasks that you flatten return something? based on this Prefect seem to have difficulties inferring the state of the flattened task
a
1. Yes it occurs with
LocalDaskExecutor
2. Yes, my flattened tasks return a list I wouldn't say it's a big list as it's only around 1000-2000 item and all of them is numbers between 1-100
a
@Ahmed Ezzat could you build an example that would reproduce the issue? Otherwise it’s hard to tell what is the issue
a
@Anna Geller I'm really trying to recreate this problem on purpose but it's really hard to do. I believe my workflow code is OK since it completes without any problems 80% of the time. are there any cases that would cause this problem? my tasks are both CPU and Network heavy. I believe this may be caused by a connectivity issue?
@Anna Geller I could provide access to the source code if needed however I'm not really comfortable sharing is information publicly so if you don't mind can I DM you?
a
I think even if you share your code, I won't be able to run it right? Reproducing it just with plain Python data structures would be easier
a
Apologies, I know how insufficient my information is, especially when I'm unable to provide reproducible steps. I'll keep trying to make the issue reproducible meanwhile if you have any updates please keep me updated
a
this sounds reasonable 👍 I’ll keep you posted if I hear any ideas from the team
a
Hi, @Anna Geller Just a quick update. After some inspection, I found that both Hasura and GraphQL endpoint get unresponsive sometimes could this be the root of the problem? currently, I'm testing with
"DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING": "false"
and it seems to produce more stable runs. the current run will take around 2-3 hours I'll keep you posted.
a
nice work! yup, this definitely is related because when API is not accessible, then the task runs can’t update their states and will cause issues. btw perhaps you could try the same with Prefect Cloud? we have a free tier with 20,000 free task runs every month - no credit card required to get started.
a
Good news! it seems like this was the root of the problem I scaled up the staging cluster and deployments and everything is working just fine. Maybe this should throw a more user friendly error. It was really hard finding the root of the problem i'm thinking of logging error with maybe something like "couldn't access run x result" rather then crashing everything? Actually we have a paid account however we love also to have self a contained environment. Thanks Anna for your help. Hope you're having a wonderful day :)
a
Gotcha, I only suggested trying out Cloud because I thought it may help with the GraphQL availability. So glad to hear you figured it out! Great day (and great Christmas break) to you!