Hi, I would like to fish for a conceptual clarification and best practices around CI/CD in ML. There seems to me that there is a functional overlap between GitLab CI/CD and Prefect; and I have to conceptualize some sort of Continuous Integration and Continuous Delivery for machine learning, which I could put into Prefect dataflows.
• As I understand it, data is better passed between Prefect tasks.
◦ This would make Prefect a better candidate for running data and model validation tests.
• GitLab CI/CD is designed to test code.
◦ I am not sure if I should use it to run data and model validation tests.
◦ I think it has its place in integrating and delivering Prefect code.
I am slightly confused whether
1. GitLab CI/CD would end up testing the same things as Prefect would at some point
2. I can do without GitLab CI/CD
It is not clear how to use one or the other specifically.
04/29/2022, 11:37 AM
Good question! In fact, there are some customers using Prefect to build CI/CD pipelines. If you think about it - CI/CD is also a workflow (even a DAG!). So building a CI DAG to deploy a Prefect flow (which imagines triggers a dbt DAG) is super meta -- you build a DAG to deploy a DAG running another DAG! 😁
I would approach it this way:
• you can build your actual workflow including your business logic and your data + model validation tests using Prefect,
• the part to package codedependencies into a Python package and/or a Docker image can be performed from your CI
If you need some examples about that, check this Discourse tagThis part of dbt blog post also shows how you could build a super simple one with CircleCI
04/29/2022, 11:52 AM
I completely agree with Anna here. Typically, you would use GitLab CI/CD to run unit/integration test, package your code (CI part) and deploy your flow in a staging/prod env (CD part). You would then use Prefect for your data flow (incl. model building and data/model validation test)
But, there is no right or wrong here. Some people also use GitLab CI/CD to trigger an ML training pipeline (which is also a data workflow).