Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi :wave:

<https://towardsdatascience.com/apache-airflow-in-2022-10-rules-to-make-it-work-b5ed130a51ad>
Reading through this article about Airflow pitfalls, I’ve found this:

“*Airflow is an orchestration framework, not an execution framework”*

I’m wondering if this applies to Prefect’s philosophy as well, meaning you should not do the actual computation inside Prefect (be it some Pandas code with data frames, some CSV processing or any other computation). Instead you should activate some other computation service (Like ECS job, a Render job, a Snowflake query…).

Thanks!

If so, I am a great sinner.

It really depends on scale. If the data you're orchestrating is complex but never &gt; 10GB, why bother getting overly sophisticated with a separate Lambda or ECS task? If it's in your database already then yeah, use SQL.

<@U04F5T1M6NQ> Yeas, makes sense. I just wondering if even working with something like a 5GB data set could some how make Prefect buggy or interfere with orchestration mechanics like what happens sometimes in Airflow?

If you're using dataframes you should be okay. When I do processing inside of Prefect, I usually don't return a large object from a task or pass a large object to one. This specifically applies to any JSON serializable object like a list of strings, since those will get hung up as Prefect tries to make a record of them.

Ok, yeah, this is what I was referring to. So maybe it’s a good idea to forbid this from the get go, to prevent hung ups like this happening. Meaning, the same code that worked 3 month ago, can suddenly break because the data set is now larger.

Yeah, if you expect the dataset to continually grow or at least have a lot of size variation, an external execution is a good idea. There are still cases where execution within Prefect is totally fine.

I would highly recommend separating orchestration and execution. We containerize all our execution and prefect merely kicks them off. It follows the single responsibility principle 

Even if you don’t use an orchestrator and create your own code to do this you would quickly wrap  that into a library and import “execution” libraries

<@U0538R6R0JY> Ok, so just to illustrate, let’s say you have containerize two parts of a flow, so each part runs inside it’s own compute, and Prefect just executes those containers as two seperate Tasks.
 The first Task downloads a big CSV and counts the rows in it.
The second Task needs that row count to do it’s job.
But now that those two Tasks run in different context/compute how would you pass that row count between them?

Persist the csv file somewhere and make the second task count it. If you pass that result then you are technically breaking idempotency. 

But if there are scenarios where this approach is valid you can persist the result of that task and reference it within the flow. You could also leverage artifacts but that is quite a new concept so I’m not too familiar with it.