Hi is anyone using Prometheus to monitor your flows We re tr Prefect Community #ask-community

Hi -- is anyone using Prometheus to monitor your f...

Hans Lellelid

11/22/2023, 12:45 AM

Hi -- is anyone using Prometheus to monitor your flows? We're trying to use pushgateway, but having issues with concurrent flow runs overwriting each other's metrics. After a bit more reading, pushgateway doesn't seem to handle distributed workloads well. And the the prom aggregation gateway was also giving us strange behavior in the metrics were not going away , so kept being re-scraped (probably there's some config there). Also, wondering, though whether setting the instance to something identifying about the flow run would be a best practice for pushgateway. Will try that out, but figured I'd ask here first, as I suspect we're not the first trying to do this. :)

Alexander Azzam

11/22/2023, 1:02 AM

Interested! Quick question: what metrics are you trying to measure in general?

Hans Lellelid

11/22/2023, 1:03 AM

Yeah, I should have given more examples. A key metric for this data processing pipeline would be how long it takes us to process a batch. (We use a Summary for this.) We also track error records (counter) and just general ingest record counter.

Hans Lellelid

11/22/2023, 1:04 AM

I am clearly not a Prometheus expert, but my understanding is that when we push up new metrics with the same labels we overwrite ones already there. So the counters step on each other from different concurrent jobs (flow runs).

Alexander Azzam

11/22/2023, 1:04 AM

Is a batch 1:1 with a flow run here? So like aggregate metrics on flow run duration and success/failure etc?

Hans Lellelid

11/22/2023, 1:05 AM

Well, not 1:1, but a flow run will process many batches.

Hans Lellelid

11/22/2023, 1:06 AM

We basically parallelize by splitting large input files and executing tasks for each file.

Alexander Azzam

11/22/2023, 1:06 AM

Oh I see. Is each batch in that flow wrapped in a sub flow or a task or something? (That might make it easier for some things I have in mind)

Alexander Azzam

11/22/2023, 1:06 AM

(I don’t have a solution yet but you’re helping me so thanks!)

Hans Lellelid

11/22/2023, 1:06 AM

Well our first stab had each batch wrapped in a task, but currently we actually have tasks sequentially iterating over many batches (we keep trying different approaches to see what scales best)

Hans Lellelid

11/22/2023, 1:08 AM

We have so many files to process and each file is many batches, so it felt like it was a little simpler and ultimately sufficiently parallelizable to just give each task a large file and let it split it up and iterate over the batches. (Always open to suggestions on what others have done that works better, though.)

Hans Lellelid

11/22/2023, 1:09 AM

The data itself is very large so shipping that between tasks was not practical. We were passing references to split batch files to the tasks, but this introduced some cleanup challenges. Anyway, that's just background.

Alexander Azzam

11/22/2023, 1:10 AM

Thanks! I’ll give it a think. Don’t have something immediate yet.

Hans Lellelid

11/22/2023, 1:14 AM

Sure, appreciate any ideas! I feel like choosing a good "instance" value might be the key we were missing, but wasn't sure if this can be high-cardinality (e.g. a flow run ID) or if that will make metrics balloon overtime. Anyway, appreciate any thoughts. Thanks!

Open in Slack

Previous Next