A brutal outage once took us 90 minutes to diagnose.
We instrumented everything, added Grafana dashboards, and pruned noisy alerts.
Months later a similar bug surfaced; this time we fixed it in under 10 minutes.
Observability turns chaos into a checklist.