Hi everyone. What is the recommended pattern to implement a groups of tasks that must be retried together"?
Let's say I have a task that checks the status of a cluster (for example EMR) and starts it if the cluster is not running. There are some downstream tasks that run on this cluster.
Is there a pattern that would ensure that before a given downstream task runs (including the case when the task needs to be retried), it will first run the task that checks/starts the cluster?
05/16/2020, 7:25 PM
Hi Pedro - from your description it sounds like you want to set up a “state dependency” between two tasks (where one task only runs after the first task has confirmed it ran without error). Check out this github issue for some more info on how to configure this: https://github.com/PrefectHQ/prefect/issues/2528
05/16/2020, 10:51 PM
Thanks, Chris. My question goes beyond a simple dependency.
I currently have an airflow DAG that 1) downloads some files, 2) starts an EMR cluster and 3) runs a pyspark script on each of the files (in parallel). See image below.
The pyspark script can take a while and sometimes the cluster dies or becomes unavailable before all the files are processed.
In this case the pyspark tasks are retied but there is no cluster to run them so they eventually fail.
One option is to make sure that each pyspark task checks for the cluster and starts one if it's not available before it runs.
I was wondering if there was a Prefect pattern that would help implement this approach with two separate tasks or if I'd have to combine the cluster check/start task with the pyspark task of each file.
05/16/2020, 10:55 PM
Ohhhh I see, that’s very interesting! There is not currently a first-class pattern for that other than combining the two operations into a single Prefect task as you mention, but it’s an interesting idea. The biggest issue I see is that it would break the DAG model which could result in some bizarre edge cases, but that shouldn’t stop us from thinking more about this — there are a few possibilities I could imagine but they’ll require some work
05/16/2020, 11:06 PM
Got it. I was wondering if it would be possible to say ... before a given task is retried, rerun its direct parents (or ideally, specify which upstream task to rerun).
Is it possible for a task to change the state of an upstream task so that it is rerun?
05/16/2020, 11:09 PM
If you run against a prefect backend you can hit the GraphQL API to achieve that, but it introduces a lot of complexity once you break the DAG model in that way