Thread
#prefect-community
    a

    agonen

    2 years ago
    Hi Folks, I would like first to say that I'm very impressed with the project and the extensive documentation and information on the channel. good work. I have a few questions this I hope might clarify a few things to me. 1. where Perfect core keeps the state of long-running tasks, what happens is the instance running the perfect core is failing? 2. If I have a long-running task, for example, I'm doing a boto call to start an emr cluster and then sending a step to it (for example do ETL and write to S3) and then I would like to and another step based on the outcome of the first step a step once the cluster is ready ? is the mean I need to write down-stream task to keep doing API calls to monitoring the step result ? or should I run the perfect core inside the pyspark job ? 3. does perfect have something similar to an airflow sensor like this GoogleCloudStoragePrefixSensor ?
    Chris White

    Chris White

    2 years ago
    Hi @agonen! 1. Prefect Core is explicitly stateless. Prefect Cloud (both free and paid versions) is our recommended stateful backend for highly available / persistent deployments of Prefect Core 2. I’m sorry I don’t think I’m following here; as long as you structure your Flow / DAG such that no downstream task should begin before its upstream dependencies finish there are many options for what your individual tasks do and how you design them 3. At this exact moment no; we are working on a similar “listener” concept in Prefect (https://docs.prefect.io/core/PINs/PIN-08-Listener-Flows.html) but until then we typically recommend Cloud’s GraphQL API as a way of orchestrating event-driven flows; see https://medium.com/the-prefect-blog/event-driven-workflows-with-aws-lambda-2ef9d8cc8f1a for an example
    a

    agonen

    2 years ago
    Thanks, @Chris White. Let me try to clarify, regarding point 2. I’ll try to clarify a little bit. I have got a task to build a dataflow starting with data extraction for GCP big query to GCS and then running train model on data GCP spark a.k.a data proc and so on. My concern that different parts of the code running on different machines. So the only option to orchestrate it is by using perfect cloud unless ill have a task to monitor to the progress of the upstream task
    Did I get it right?
    Chris White

    Chris White

    2 years ago
    Hmmm I think I see what you mean - generally I’d recommend waiting for job completion in the same task that submits the work, if you rely on completion of the remote work before proceeding with your tasks.
    I don’t think this situation is unique to Prefect Core or Prefect at all in fact