Hi everyone, I’m looking for suggestions for how t...
# ask-community
f
Hi everyone, I’m looking for suggestions for how to use prefect together with a Spark cluster running on aws. Is there an equivalent to Airflow’s SparkSubmitOperator ? I’ve only found a DataBricksTask in the docs, but I’m not interested in getting a Databricks subscription at this point.
a
Hi @Fina Silva-Santisteban, happy New Year! I think you could use awswrangler for that. This notebook shows how to create an EMR cluster, upload a Pyspark script and submit this script to the EMR cluster in a few commands that you could put together into Prefect tasks.
🌟 2
k
If you use
spark-submit
, then I think it would be through the ShellTask
🌟 1
f
That looks like a useful library!! Thank you for sharing @Anna Geller! @Kevin Kho oh I see, so basically replicate whatever I’d do in the shell by using the ShellTask, sounds like a good workaround! Would you happen to know why Prefect doesn’t have a dedicated Spark task?? Spark seems to be one of the standard tools in the data engineering toolbox 🤔
👍 1
k
I guess just noone has gotten around to contributing. Do you think it would just be a subclass of the ShellTask and then fill in the appropriate arguments?
💡 1
f
@Kevin Kho something like the db tasks, e.g. PostgresExecute or SnowflakeQuery Task, where you set up the connection and then point at the file to deploy would be nice 😅 It’s basically what Airflow’s SparkSubmitOperator does
Actually that sounds like it could be just a subclass of the shelltask!
k
If you end up coding it, I’m sure we’d take a PR for it
🤩 1
🙌 1