Hi -- is anyone using Prometheus to monitor your f...
# ask-community
h
Hi -- is anyone using Prometheus to monitor your flows? We're trying to use pushgateway, but having issues with concurrent flow runs overwriting each other's metrics. After a bit more reading, pushgateway doesn't seem to handle distributed workloads well. And the the prom aggregation gateway was also giving us strange behavior in the metrics were not going away , so kept being re-scraped (probably there's some config there). Also, wondering, though whether setting the instance to something identifying about the flow run would be a best practice for pushgateway. Will try that out, but figured I'd ask here first, as I suspect we're not the first trying to do this. :)
a
Interested! Quick question: what metrics are you trying to measure in general?
h
Yeah, I should have given more examples. A key metric for this data processing pipeline would be how long it takes us to process a batch. (We use a Summary for this.) We also track error records (counter) and just general ingest record counter.
I am clearly not a Prometheus expert, but my understanding is that when we push up new metrics with the same labels we overwrite ones already there. So the counters step on each other from different concurrent jobs (flow runs).
a
Is a batch 1:1 with a flow run here? So like aggregate metrics on flow run duration and success/failure etc?
h
Well, not 1:1, but a flow run will process many batches.
We basically parallelize by splitting large input files and executing tasks for each file.
a
Oh I see. Is each batch in that flow wrapped in a sub flow or a task or something? (That might make it easier for some things I have in mind)
(I don’t have a solution yet but you’re helping me so thanks!)
h
Well our first stab had each batch wrapped in a task, but currently we actually have tasks sequentially iterating over many batches (we keep trying different approaches to see what scales best)
We have so many files to process and each file is many batches, so it felt like it was a little simpler and ultimately sufficiently parallelizable to just give each task a large file and let it split it up and iterate over the batches. (Always open to suggestions on what others have done that works better, though.)
The data itself is very large so shipping that between tasks was not practical. We were passing references to split batch files to the tasks, but this introduced some cleanup challenges. Anyway, that's just background.
a
Thanks! I’ll give it a think. Don’t have something immediate yet.
h
Sure, appreciate any ideas! I feel like choosing a good "instance" value might be the key we were missing, but wasn't sure if this can be high-cardinality (e.g. a flow run ID) or if that will make metrics balloon overtime. Anyway, appreciate any thoughts. Thanks!