Hi everyone, I’m looking for suggestions for how to use prefect together with a Spark cluster running on aws. Is there an equivalent to Airflow’s SparkSubmitOperator ? I’ve only found a DataBricksTask in the docs, but I’m not interested in getting a Databricks subscription at this point.
a
Anna Geller
01/02/2022, 11:39 PM
Hi @Fina Silva-Santisteban, happy New Year!
I think you could use awswrangler for that. This notebook shows how to create an EMR cluster, upload a Pyspark script and submit this script to the EMR cluster in a few commands that you could put together into Prefect tasks.
🌟 2
k
Kevin Kho
01/03/2022, 2:43 PM
If you use
spark-submit
, then I think it would be through the ShellTask
🌟 1
f
Fina Silva-Santisteban
01/03/2022, 3:40 PM
That looks like a useful library!! Thank you for sharing @Anna Geller!
@Kevin Kho oh I see, so basically replicate whatever I’d do in the shell by using the ShellTask, sounds like a good workaround! Would you happen to know why Prefect doesn’t have a dedicated Spark task?? Spark seems to be one of the standard tools in the data engineering toolbox 🤔
👍 1
k
Kevin Kho
01/03/2022, 3:44 PM
I guess just noone has gotten around to contributing. Do you think it would just be a subclass of the ShellTask and then fill in the appropriate arguments?
💡 1
f
Fina Silva-Santisteban
01/03/2022, 3:59 PM
@Kevin Kho something like the db tasks, e.g. PostgresExecute or SnowflakeQuery Task, where you set up the connection and then point at the file to deploy would be nice 😅 It’s basically what Airflow’s SparkSubmitOperator does
Fina Silva-Santisteban
01/03/2022, 4:00 PM
Actually that sounds like it could be just a subclass of the shelltask!
k
Kevin Kho
01/03/2022, 4:10 PM
If you end up coding it, I’m sure we’d take a PR for it
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.