Hi I am using prefect to run multiple spark jobs on k8s Each Prefect Community #prefect-server

Hi, I am using prefect to run multiple spark jobs ...

Sharath Chandra

04/20/2022, 5:46 AM

Hi, I am using prefect to run multiple spark jobs on k8s. Each of these jobs can run long running with some of them have execution times of more than an hour. The jobs are mapped and executed. However in some instances I can see that only few of the jobs in the map are executing. I am using

LocalExecutor

which these mapped jobs are running sequentially. Are there instances where the LocalExecutor is not able to track the spark job running on k8s and thus not able to trigger the subsequent tasks in the map ?

Sharath Chandra

04/20/2022, 5:51 AM

Below is the sample flow

Copy code

with Flow("Master", schedule=schedule, run_config=master_run_config) as flow:

  jobs = ["x1","x2", "xn"]

  task_compute_bots_markers = task_spark_submit_compute.map(
        run_id=jobs
    )

Each task is a

spark-submit

executed as

ShellTask

Anna Geller

04/20/2022, 9:19 AM

If you get heartbeat issues because of this long-running job, check this Discourse topic Your code example looks fine. To run those in parallel, you can attach

LocalDaskExecutor

to your flow.

However in some instances I can see that only few of the jobs in the map are executing.

Can you share a full code example and some screenshots or logs of what you see? Where are your Prefect agent and Spark cluster running?

Sharath Chandra

04/20/2022, 2:34 PM

The prefect agent and spark cluster is running on azure k8s cluster

Sharath Chandra

04/20/2022, 2:35 PM

I tried today with

LocalDaskExecutor

. I can see that the jobs are running in parallel, but even here the long running jobs fail to get any status and marked as completed

Sharath Chandra

04/20/2022, 2:37 PM

Sharath Chandra

04/20/2022, 2:38 PM

Correspondingly on the spark cluster, the same jobs are in

Succeeded

phase

Anna Geller

04/20/2022, 2:57 PM

I understand your frustration, but we don't have any solution for that currently. It's just super challenging to manage such long-running jobs, especially triggered from Azure AKS. You can read more in this section.

Sharath Chandra

04/21/2022, 3:20 AM

Ok, What could be the workaround through this. Is it possible to have another flow which keep monitoring the tasks in a

Loop

and then possibly based on some condition mark the specific

Task

SUCCESS

and

FAIL

? Will the executor then proceed to next items in map ?

Anna Geller

04/21/2022, 11:33 AM

Here are some ideas: • You can run your subprocess with a local agent rather than in a Kubernetes job. • You can offload long-running jobs that cause heartbeat issues into a separate flow (subflow) ◦ this subflow can run on some local machine while your dependency-heavy containerized child flows run on Kubernetes ◦ you may optionally turn off flow heartbeats for such flow using

disable_flow_heartbeat

mutation or from the UI, ◦ you can orchestrate those child flows from a parent flow (i.e. build a flow-of-flows).

203 Views

Open in Slack

Previous Next