Nova Westlake

07/29/2022, 8:51 PM
Hey all, I'm hoping I can get a little guidance on Prefect and whether it would be the right tool for what my team needs to do. We are designing a data processing workflow for the cloud. It currently exists on-prem. It works with a lot of python libraries and tools on Windows. I was planning on building each step of the process into a docker container, and then using AWS ECS (Elastic Container Service, orchestration service like Kubernetes) to manage those containers and the dataflow, and to use a custom built web app and message queue etc. to manage the workflow. Someone on my team is recommending we used Prefect instead tho. And I'm trying to figure out if Prefect is a more appropriate solution for our problem. I think what I'm most stuck on is is Prefect made to manage multiple separate tools that are built as Docker containers? I want to sandbox these data processing steps from each other in terms of their dependencies, library versions etc. But it feels like Prefect is more intended to be a monolithic app? How would you normally manage running tools that need to use python libraries with different versions in prefect, etc? Sorry if this is a bit newbish in terms of a question. My background is as a fullstack webapp developer, and not with Python. So figuring out how to properly architect a complex data processing flow with Python is a stretch for me!


07/30/2022, 5:49 AM
Well first of all, what you need is exactly a workflow orchestration engine like Prefect! This will handle scheduling, task/step retries, observablility, flow visualization, .. Do note however that there are several other alternatives on the market as well. Most notably Airflow and Argo Workflows. The former is one of the first of its kind so it is established, but in my opinion also becoming legacy (it was build in the era of long running Hadoop jobs and it seriously lacks flexibility compared to the others). Argo Workflows on the other hand is a newer generation, k8s native workflow engine and does exactly what you want to do run each tasks as separate container. We use it at our company and even though I really like it, I think there are two major drawbacks: 1. because it is k8s native, workflows are defined using yaml manifest. So maintaining and deploying these requires a certain skillset that not a lot of devs have. There is a python API too, but imo that's doesn't change things too much if have a large number of workflows to maintain 2. each task/step runs as a separate pod. Considering pod startup time (pulling a container image from registry and starting it), you are forced to bundle tasks together into bigger chunks to avoid too much overhead. But you can only benefit from e.g. retries a pod level. So it's a constant balancing act… Prefect imo solves both of these problems with their excellent flow-of-flows pattern and the fact that its Python native