Hi I m seeing unusual errors while using prefect with dask c Prefect Community #prefect-server

Hi, I'm seeing unusual errors. while using prefec...

Ahmed Ezzat

12/21/2021, 10:06 AM

Hi, I'm seeing unusual errors. while using prefect with dask cluster. here is a more information: https://github.com/PrefectHQ/prefect/issues/5252

Copy code

flow.executor = prefect.executors.DaskExecutor(
            cluster_class=lambda: KubeCluster(
                pod_template=make_pod_spec(
                    memory_request="64M",
                    memory_limit="4G",
                    cpu_request="0.5",
                    cpu_limit="8",
                    threads_per_worker=24,
                    image=prefect.context.image,
                ),
                deploy_mode="remote",
                idle_timeout="0",
                scheduler_service_wait_timeout="0",
                env=dict(os.environ)
                | {
                    "DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD": "fork",
                    "DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES": "100",
                },
            ),
            adapt_kwargs={"minimum": min_workers, "maximum": max_workers},
        )

Anna Geller

12/21/2021, 10:29 AM

@Ahmed Ezzat thanks for the detailed information in the issue. I don’t see why this may occur, will ask the team. Meanwhile, do you get the same error when using

LocalDaskExecutor

or does it only happen with Dask?

Anna Geller

12/21/2021, 10:56 AM

also: do your tasks that you flatten return something? based on this Prefect seem to have difficulties inferring the state of the flattened task

Ahmed Ezzat

12/21/2021, 4:03 PM

1. Yes it occurs with

LocalDaskExecutor

2. Yes, my flattened tasks return a list I wouldn't say it's a big list as it's only around 1000-2000 item and all of them is numbers between 1-100

Anna Geller

12/21/2021, 4:13 PM

@Ahmed Ezzat could you build an example that would reproduce the issue? Otherwise it’s hard to tell what is the issue

Ahmed Ezzat

12/21/2021, 5:40 PM

@Anna Geller I'm really trying to recreate this problem on purpose but it's really hard to do. I believe my workflow code is OK since it completes without any problems 80% of the time. are there any cases that would cause this problem? my tasks are both CPU and Network heavy. I believe this may be caused by a connectivity issue?

Ahmed Ezzat

12/21/2021, 5:45 PM

@Anna Geller I could provide access to the source code if needed however I'm not really comfortable sharing is information publicly so if you don't mind can I DM you?

Anna Geller

12/21/2021, 5:53 PM

I think even if you share your code, I won't be able to run it right? Reproducing it just with plain Python data structures would be easier

Ahmed Ezzat

12/21/2021, 6:03 PM

Apologies, I know how insufficient my information is, especially when I'm unable to provide reproducible steps. I'll keep trying to make the issue reproducible meanwhile if you have any updates please keep me updated

Anna Geller

12/21/2021, 6:04 PM

this sounds reasonable 👍 I’ll keep you posted if I hear any ideas from the team

Ahmed Ezzat

12/22/2021, 4:03 AM

Hi, @Anna Geller Just a quick update. After some inspection, I found that both Hasura and GraphQL endpoint get unresponsive sometimes could this be the root of the problem? currently, I'm testing with

"DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING": "false"

and it seems to produce more stable runs. the current run will take around 2-3 hours I'll keep you posted.

Anna Geller

12/22/2021, 10:50 AM

nice work! yup, this definitely is related because when API is not accessible, then the task runs can’t update their states and will cause issues. btw perhaps you could try the same with Prefect Cloud? we have a free tier with 20,000 free task runs every month - no credit card required to get started.

Ahmed Ezzat

12/22/2021, 11:52 AM

Good news! it seems like this was the root of the problem I scaled up the staging cluster and deployments and everything is working just fine. Maybe this should throw a more user friendly error. It was really hard finding the root of the problem i'm thinking of logging error with maybe something like "couldn't access run x result" rather then crashing everything? Actually we have a paid account however we love also to have self a contained environment. Thanks Anna for your help. Hope you're having a wonderful day :)

Anna Geller

12/22/2021, 12:07 PM

Gotcha, I only suggested trying out Cloud because I thought it may help with the GraphQL availability. So glad to hear you figured it out! Great day (and great Christmas break) to you!

2 Views

Open in Slack

Previous Next