Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

*Requesting input as we think through a new feature!*

What sorts of stats or reliability benchmarks do you track for stakeholders on a regular bases? What sorts of SLA promises do you provide to clients/stakeholders? What would help you better communicate system performance/uptime/incident responsiveness? Could be things you currently track, would like to track, or have to lookup, etc.

We're looking into building an incident management type of feature for better documenting and collaborating on issues, as well as sharing updates on issues with stakeholders. This could also help us more easily show that a failed run was recovered somehow (maybe a fresh manual re-run, etc). Hoping to build this in a way that contributes to these sorts of service reliability stats so it's easier for folks to share them.

We use datadog to monitor the infrastructure,

As for the data ingestion pipelines, we do sla on the data itself. But it would be great to have SLA on flow level - specially if they are periodic.

Average, P50 , P90 stats on flow level / deployment level would be great ["run" time / "late" time]