Hi all, not necessarily a question, but hopefully ...
# prefect-community
a
Hi all, not necessarily a question, but hopefully some guidance. I’ve been using Airflow for about 6 months now, and I feel like I’ve “bent” to the framework a ton. For example, manually saving data to intermediate storage, then passing the location through to the next step of the pipeline since you can’t pass data directly through. I was hoping if anyone had some best practices/tips on what to try to control in Prefect vs letting Prefect control for you. Thanks!
watching 1
j
Hey @Alex Cano that’s a fantastic question — we should really write something up! I definitely know what you’re describing with regard to Airflow (frankly, it’s why Prefect exists). So our goal is that you shouldn’t “bend” to Prefect at all — quite the opposite, in fact. It should “just work” with your code. So what we usually suggest (without knowing any specifics) is to think of all the logical steps in your workflow and how they relate — those are going to become individual Prefect tasks. Some people like to think of each function in their script; for other people we suggest imagining that you’re going to literally draw what happens in your workflow on a whiteboard. You’ll probably draw a box for each “thing to do” and an arrow connecting the boxes in order. Think of each box as a Prefect task, and the arrows as the Prefect flow. Most of the time, this simple exercise will help you describe your system nicely. It seems a little obvious, but it will help you escape some of the limitations of other systems (for example, no need to worry about temporary data storage because Prefect will handle that for you)
a
In my PoC, I’ve definitely felt the “just work” mindset with how things work. One of the things I’ve needed a bit more clarity on is where things would start to break down. For example, with the example I gave above with Airflow, the code would work regardless of whether the file was 1MB or 1TB (exaggeration but still). Are these things handled to these extremes with Prefect? Or are there limits that I should be cognizant of? Are there configuration options I can review to avoid that problem? Many thanks for the response
j
As a general rule, Prefect will be more performant than Airflow if you want to simply port your code directly. To be more specific will depend on your execution engine. For example, if you’re using a
LocalExecutor
, you’re bounded by available RAM. If you’re using a Dask cluster, you’ll be bounded by the amount of data you need to serialize over the wire at once. 1TB will almost certainly need to go in cloud storage 🙂 but 1GB should be totally fine in either case on a reasonably provisioned machine. If you want to ping us (cc @Chris White) privately with a specific concern, we’ll be happy to guide you
c
Yea, seconding what @Jeremiah said, I tend to tell people that with Prefect, you are responsible for controlling: - what your tasks do / the code they run - where your flows execute (this includes all of the concerns Jeremiah refers to w.r.t. available RAM / the constraints of a Dask cluster / etc.) and Prefect takes care of the rest
a
That sounds great. The questions primarily come from one overarching question: if I need to scale, which knobs do I need to tweak. Specifically for dask, it’s sounding like I/O over the network? Even if that’s the constraint, it doesn’t sound like Prefect would break, just go slow. Am I understanding that correctly? I don’t think I’m anywhere near a use case needing to scale very big, but just wanted to make sure I had a decent understanding of what would need to happen as I’m not an expert on Dask.
c
yea, that’s essentially correct. Funnily enough I’m working with an early partner on a very large scale flow right now. We are running a Dask cluster in Kubernetes and Prefect is doing “all the right things”, but we’re finding that tasks can fail if kubernetes evicts one of the dask workers --> in our particular case, network I/O isn’t a problem, but available memory is a bottleneck
a
Ah gotcha. Is Dask doing something in the background that’s loading data into memory causing k8s to evict the workers? Or is that just how the customer has written the code? I’m curious on the performance hit you’d take by creating a mapping of tasks that operate row by row on generators. Do you know off hand if the generator itself gets pickled and sent over the wire or if each row has to get loaded into memory then pickled? Either way, it sounds like a decent strategy to avoid memory issues.
c
That’s just how the customer has written the code --> there’s a particular task which requires much much more memory than we had anticipated Unfortunately generators are not pickleable, and thus can’t be sent over the wire, so I’m 99% dask will load each row in memory first before doing the mapping
j
@Alex Cano don’t tell @Chris White but i have a dream of a fully generator-based execution model and he will murder me if i publicize it
💯 1
😑 1
a
Ah that’s frustrating that they arent pickle-able. It seems the workaround to that is to micro-batch your way through a workflow if you aren’t sure it’ll load into memory. All of this has definitely helped inspire confidence in Prefect as a platform to build on top of! I’m gonna start building away and see what happens. Thanks for the help 🙌🙌🙌
j
👊 we’re here if you need any help at all
⬆️ 2