Is there any plan to create a task-library task fo...
# ask-community
r
Is there any plan to create a task-library task for running Spark on kubernetes (as opposed to Databricks)?
n
Hi @Ryan Sattler - no plans for that internally at the moment but much of the task library is user-contributed... if that's something you'd like to see we welcome PRs 😄
g
I was thinking about this as well and I wonder if we could create the class to initiate the SparkContext inside the flow (we will need to have a docker image with Spark installed) and point the SparkConfig to the Kubernetes Spark cluster.
k
I have used Databricks a lot before but don’t have as much experience with Kubrenetes Spark. Would the task run
spark-submit
and that’s how you would connect to the Kubernetes Spark cluster? If your Spark is already configured, wouldn’t just instantiating SparkSession inside a flow work? And to what Nicholas said, we’d surely welcome PRs for this.
g
We could use a BashTask to run the
spark-submit
command but I would say it might be possible to instantiate a SparkSesssion inside the Flow (will have to give it a try). I did not look but is it how the Databricks one is working (session)?
k
Databricks has a
databricks-connect
library that hijacks your Spark installation so
import pyspark
and creating the
SparkSession
compiles the DAG locally, then sends them to the configured cluster when there is an action.
g
Hummm I guess it would be possible to instantiate SparkSession inside Flow if we run it inside an image with
pyspark