https://prefect.io logo
Title
f

Feliks Krawczyk

08/29/2019, 1:57 AM
Sorry this might be a silly question, but still coming from the Airflow world. In your documentation:
from prefect import task, Flow

@task
def create_cluster():
    cluster = create_spark_cluster()
    return cluster

@task
def run_spark_job(cluster):
    submit_job(cluster)

@task
def tear_down_cluster(cluster):
    tear_down(cluster)


with Flow("Spark") as flow:
    # define data dependencies
    cluster = create_cluster()
    submitted = run_spark_job(cluster)
    tear_down_cluster(cluster)

    # wait for the job to finish before tearing down the cluster
    tear_down_cluster.set_upstream(submitted)
Wouldn’t you also need to add
submitted.set_upstream(cluster)
to ensure this works? Or is there some hidden logic that means when you define a flow that it runs the lines in order, so
submitted
runs only after
cluster
? If that is the case, how do you get tasks to run in parallel?
c

Chris White

08/29/2019, 1:58 AM
no worries -> Prefect actually can infer your dependencies based on how you call things in the functional API. So in this case
submitted = run_spark_job(cluster)
set
cluster
as an upstream dependency of
submitted
automatically
it doesn’t run the lines in order, it runs your DAG in topological order, similar to Airflow -> so no downstream dependencies are allowed to start until all tasks which are upstream are finished
check out
flow.visualize()
on the flow above to see what it looks like and how the dependencies have been set
f

Feliks Krawczyk

08/29/2019, 2:02 AM
Oh gotcha! Sorry I totally missed the functional part of run_spark_job(
cluster
)
Makes sense 👍
c

Chris White

08/29/2019, 2:02 AM
yup no worries!
f

Feliks Krawczyk

08/29/2019, 2:02 AM
Thank you for clarifying!
c

Chris White

08/29/2019, 2:02 AM
anytime!