Thread
#prefect-server
    a

    Ahmed Ezzat

    9 months ago
    Hi, I'm seeing unusual errors. while using prefect with dask cluster. here is a more information: https://github.com/PrefectHQ/prefect/issues/5252
    flow.executor = prefect.executors.DaskExecutor(
                cluster_class=lambda: KubeCluster(
                    pod_template=make_pod_spec(
                        memory_request="64M",
                        memory_limit="4G",
                        cpu_request="0.5",
                        cpu_limit="8",
                        threads_per_worker=24,
                        image=prefect.context.image,
                    ),
                    deploy_mode="remote",
                    idle_timeout="0",
                    scheduler_service_wait_timeout="0",
                    env=dict(os.environ)
                    | {
                        "DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD": "fork",
                        "DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES": "100",
                    },
                ),
                adapt_kwargs={"minimum": min_workers, "maximum": max_workers},
            )
    Anna Geller

    Anna Geller

    9 months ago
    @Ahmed Ezzat thanks for the detailed information in the issue. I don’t see why this may occur, will ask the team. Meanwhile, do you get the same error when using
    LocalDaskExecutor
    or does it only happen with Dask?
    also: do your tasks that you flatten return something? based on this Prefect seem to have difficulties inferring the state of the flattened task
    a

    Ahmed Ezzat

    9 months ago
    1. Yes it occurs with
    LocalDaskExecutor
    2. Yes, my flattened tasks return a list I wouldn't say it's a big list as it's only around 1000-2000 item and all of them is numbers between 1-100
    Anna Geller

    Anna Geller

    9 months ago
    @Ahmed Ezzat could you build an example that would reproduce the issue? Otherwise it’s hard to tell what is the issue
    a

    Ahmed Ezzat

    9 months ago
    @Anna Geller I'm really trying to recreate this problem on purpose but it's really hard to do. I believe my workflow code is OK since it completes without any problems 80% of the time. are there any cases that would cause this problem? my tasks are both CPU and Network heavy. I believe this may be caused by a connectivity issue?
    @Anna Geller I could provide access to the source code if needed however I'm not really comfortable sharing is information publicly so if you don't mind can I DM you?
    Anna Geller

    Anna Geller

    9 months ago
    I think even if you share your code, I won't be able to run it right? Reproducing it just with plain Python data structures would be easier
    a

    Ahmed Ezzat

    9 months ago
    Apologies, I know how insufficient my information is, especially when I'm unable to provide reproducible steps. I'll keep trying to make the issue reproducible meanwhile if you have any updates please keep me updated
    Anna Geller

    Anna Geller

    9 months ago
    this sounds reasonable 👍 I’ll keep you posted if I hear any ideas from the team
    a

    Ahmed Ezzat

    9 months ago
    Hi, @Anna Geller Just a quick update. After some inspection, I found that both Hasura and GraphQL endpoint get unresponsive sometimes could this be the root of the problem? currently, I'm testing with
    "DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING": "false"
    and it seems to produce more stable runs. the current run will take around 2-3 hours I'll keep you posted.
    Anna Geller

    Anna Geller

    9 months ago
    nice work! yup, this definitely is related because when API is not accessible, then the task runs can’t update their states and will cause issues. btw perhaps you could try the same with Prefect Cloud? we have a free tier with 20,000 free task runs every month - no credit card required to get started.
    a

    Ahmed Ezzat

    9 months ago
    Good news! it seems like this was the root of the problem I scaled up the staging cluster and deployments and everything is working just fine. Maybe this should throw a more user friendly error. It was really hard finding the root of the problem i'm thinking of logging error with maybe something like "couldn't access run x result" rather then crashing everything? Actually we have a paid account however we love also to have self a contained environment. Thanks Anna for your help. Hope you're having a wonderful day 😃
    Anna Geller

    Anna Geller

    9 months ago
    Gotcha, I only suggested trying out Cloud because I thought it may help with the GraphQL availability. So glad to hear you figured it out! Great day (and great Christmas break) to you!