Incident history, MTTR, MTBF

3 min read

Statistics answer the questions "how often do we go down", "how long does it take to recover", "when did we have the worst week". In the monitor detail, the Statistics tab.

Key metrics

Uptime %

The ratio of UP time / total time. A classic metric. ePulz.io calculates it across 24h / 7d / 30d / 90d / 365d windows.

MTTR (Mean Time To Recovery)

Average time from DOWN detection to return to UP. If you have 5 incidents and each lasted 8 minutes, MTTR = 8 min. Goal: reduce via better alerting, auto-restart, on-call rotations.

MTBF (Mean Time Between Failures)

Average time between outages. If you have 5 outages in 30 days = MTBF 6 days. Goal: increase via redundancy, better testing, postmortem action items.

Incident frequency

Number of incidents by weeks / months. Watch the trend - you should see a decline after your SRE initiatives.

Incident table

The last 50 incidents with columns:

DOWN start (timestamp)
End / active (timestamp or "-> active")
Duration (HH:MM:SS)
Reason (HTTP 502, SSL expired, DNS timeout, keyword missing, ...)
Region consensus (if multi-region: which regions confirmed)

Export

The "Export CSV" button downloads the incident table for import into Excel / a BI tool. PDF SLA report see SLA reports.