I’m interested to hear how you’ve handled scaling up Prefect and dealing with a large code-base.
We have around ~40 data pipelines to manage, around ~15 using Prefect. We’re building new pipelines all the time and old pipelines are being migrated too.
Currently these pipelines run the ELT process end to end and we have helper libraries that are reused across pipelines.
We’re considering how best to architect these pipelines to encourage code-reuse and effectively maintain them. We’ll have 3 teams maintaining them so code ownership is a consideration too.
One idea is to split up the pipelines so they don’t run the full process end to end. We could use an event driven architecture, so smaller flows are triggered to run based on an external event handler. It gives us more choices for team ownership and could make it a little easier to replace or add steps in the ELT process.
The alternative is to keep doing what we’re doing and make the most of shared libraries to encourage code reuse.
In either case, we’ll do more to use configuration so similar datasets are processed using the same data pipeline.
I’d be keen to hear your approach to handling large code-bases.
05/13/2022, 10:50 AM
About scale: if you leverage Prefect Cloud, you don't need to worry about scaling up the entire orchestration API - you only need to ensure your execution layer scales, and this is relatively straightforward if you leverage e.g. horizontally scaled Kubernetes cluster or even serverless/autopilot Kubernetes cluster. For Prefect Server check this topic and related topic linked there.
For repository structure, I understand that it's an important and not easy decision but you need to consider (based on your use case/team needs):
• whether monorepo or one repo per project makes more sense
• what are code dependencies of specific flows/projects - those might be easier to manage in a single repo e.g. to reuse some shared utility modules and shared Docker images
For some repository examples and packaging dependencies, check this one
Also, this discussion may help
and re event-driven workflows, that's totally supported, this page dives deeper into it
05/13/2022, 12:53 PM
Thanks very much @Anna Geller. Will check out those links.