Hey! I have a pipeline that monitors different th...
# ask-community
n
Hey! I have a pipeline that monitors different things set by the customers (can be websites, can be dbs,...). All of them have a cursor to the last item and when they detect new things they trigger an action downstream. Most of the time resulting in some form of data being gathered, processed and stored. The monitor is a poll, so we can use the same approach across customers. 1. At the moment the monitoring is completely decoupled. We have the configs stored in a db and a job running to see whether a monitoring has to be run, which then triggers the right monitor. I would love to hear whether you have seen / know a better approach for this. 2. The gathering is often quite complex. We often have to log into a webapp by our customers and gather the data from there, when they are not able to expose an API. So the code for the gathering is quite heavy. I would like to keep this decoupled from the actual DAG, since it is written in Javascript as well. Is there a common way of integrating this. And how would you layout the monorepo? I am playing around with different structures at the moment, but have not found one that makes sense to me. If you have seen any general posts or open source repos, which are good examples for project layouts, I would love to see them. 3. After the data is stored, there are completely different DAGs being triggered for running some post-processing and preparing the data for analytics. How would you model the dependencies? The data gathering runs on a more frequent schedule whereas the analytics are only necessary at most once a day, or once a week depending on the customer (which is why we decoupled it). If the analytics run once per day, we would like to wait for the previous collection or the nearest collection DAG to finish, before we start the analytics. 4. We are also playing with the thought of having a few workloads run end to end. So basically instead of having a dag for gathering, one for processing and one for the analytics prep, we have one running end to end and determine based on the customer id which goes which. If anyone has any best practices or ideas on this, I would really appreciate any hints. I just numbered it for ease. Sorry for the long message. I just tried to give as much context as possible.
n
hi @Nicolay! these docs may be useful for you
examples for project layouts
I'd point you here for an opinionated layout of mine, but really anywhere you can write python should be fine, since for the python deployment interface its normally just
python file_containing_flow_to_deploy.py
At the moment the monitoring is completely decoupled. We have the configs stored in a db and a job running to see whether a monitoring has to be run, which then triggers the right monitor. I would love to hear whether you have seen / know a better approach for this.
im not sure I could make a concrete recommendation based on only this information, but I will say that if you have the ability to curl a webhook from inside your existing app, then you can trigger arbitrary downtream work in python here's an example that may be close to what you're looking for, where I simulate some code elsewhere to gather and write some result to s3, then emits an event to trigger downstream stuff, passing the customer id and whatever else
let me know if this is directionally helpful!
n
Hey Nate! Thanks for the input. Yeah thought a lot about the issue. The fact that we have an event driven workload for parts of the pipeline makes DAG not the most useful abstraction, tbh. Might have to break out of them.
n
makes sense. a lot of folks start using us incrementally before they stop using a tool like Airflow. they can start having their DAGs send events to trigger work in prefect. just to be extra clear, prefect doesn't require any definition of a DAG, the MVP for a prefect deployment is a python function with a decorator on top