Hi, I am using prefect to run multiple spark jobs ...
# prefect-server
s
Hi, I am using prefect to run multiple spark jobs on k8s. Each of these jobs can run long running with some of them have execution times of more than an hour. The jobs are mapped and executed. However in some instances I can see that only few of the jobs in the map are executing. I am using
LocalExecutor
which these mapped jobs are running sequentially. Are there instances where the LocalExecutor is not able to track the spark job running on k8s and thus not able to trigger the subsequent tasks in the map ?
Below is the sample flow
Copy code
with Flow("Master", schedule=schedule, run_config=master_run_config) as flow:

  jobs = ["x1","x2", "xn"]

  task_compute_bots_markers = task_spark_submit_compute.map(
        run_id=jobs
    )
Each task is a
spark-submit
executed as
ShellTask
a
If you get heartbeat issues because of this long-running job, check this Discourse topic Your code example looks fine. To run those in parallel, you can attach
LocalDaskExecutor
to your flow.
However in some instances I can see that only few of the jobs in the map are executing.
Can you share a full code example and some screenshots or logs of what you see? Where are your Prefect agent and Spark cluster running?
s
The prefect agent and spark cluster is running on azure k8s cluster
I tried today with
LocalDaskExecutor
. I can see that the jobs are running in parallel, but even here the long running jobs fail to get any status and marked as completed
Correspondingly on the spark cluster, the same jobs are in
Succeeded
phase
a
I understand your frustration, but we don't have any solution for that currently. It's just super challenging to manage such long-running jobs, especially triggered from Azure AKS. You can read more in this section.
s
Ok, What could be the workaround through this. Is it possible to have another flow which keep monitoring the tasks in a
Loop
and then possibly based on some condition mark the specific
Task
as
SUCCESS
and
FAIL
? Will the executor then proceed to next items in map ?
a
Here are some ideas: • You can run your subprocess with a local agent rather than in a Kubernetes job. • You can offload long-running jobs that cause heartbeat issues into a separate flow (subflow) ◦ this subflow can run on some local machine while your dependency-heavy containerized child flows run on Kubernetes ◦ you may optionally turn off flow heartbeats for such flow using
disable_flow_heartbeat
mutation or from the UI, ◦ you can orchestrate those child flows from a parent flow (i.e. build a flow-of-flows).
164 Views