Hi guys! I was wondering what would be the best wa...
# prefect-community
s
Hi guys! I was wondering what would be the best way to run spark jobs on AWS EMR with Prefect? I thought about creating an execution environment, but I'm not sure if that is really the best solution here. I am especially curious in how to manage step failures (we have a partial solution which runs a step monitor every once in a while and restarts steps, but that won't show up in Prefect), also cluster instance failures where the entire cluster needs to be restarted, but the finished steps do not.
j
Since prefect doesn't run on spark, your prefect flow itself would need to run elsewhere (perhaps on ECS or locally, depdending on your infrastructure). You'd then create prefect tasks to do all the spark things you're looking for. • If you only want to submit a spark job, you might have a task that creates the cluster, submits the job, then waits for the job to finish. • If you want to do something more complicated you might have a task that creates an EMR spark cluster, then one or more tasks that submit work to it, then a task that shuts down the cluster. To do this nicely, you might look into creating a
resource_manager
task (see https://docs.prefect.io/core/idioms/resource-manager.html) for creating/destroying an emr cluster and use that within your flow. I'm not familiar enough with EMR to provide more information than that, but hopefully that's enough to get you started.
s
Yeah, I searched through the channel and looked into the resource manger class, that sounds like a good solution for cluster startup / shutdown, but I'm not sure how it would handle a cluster failure. I am assuming that the remaining tasks would be failed? Can I then resume the flow from a specific task and have the resource manager start automatically as well?
j
I am assuming that the remaining tasks would be failed?
This is correct, unless you explicitly configure your tasks to still run on upstream task failure.
Can I then resume the flow from a specific task and have the resource manager start automatically as well?
Currently this won't work nicely, as the cluster won't be restarted. However, your question prompted me to fix this :), see https://github.com/PrefectHQ/prefect/pull/3689. I'd expect this to be fixed and out in the next release (thanks for the reminder!).
s
Thanks, that clears up a lot - maybe failure handling would a good thing to add to the resource manager documentation?