Requesting input as we think through a new feature!
What sorts of stats or reliability benchmarks do you track for stakeholders on a regular bases? What sorts of SLA promises do you provide to clients/stakeholders? What would help you better communicate system performance/uptime/incident responsiveness? Could be things you currently track, would like to track, or have to lookup, etc.
We're looking into building an incident management type of feature for better documenting and collaborating on issues, as well as sharing updates on issues with stakeholders. This could also help us more easily show that a failed run was recovered somehow (maybe a fresh manual re-run, etc). Hoping to build this in a way that contributes to these sorts of service reliability stats so it's easier for folks to share them.