From 90 Minutes to 10: Monitoring Overhaul

January 22, 2025

A brutal outage once took us 90 minutes to diagnose.
We instrumented everything, added Grafana dashboards, and pruned noisy alerts.

Months later a similar bug surfaced; this time we fixed it in under 10 minutes.
Observability turns chaos into a checklist.