It started as a routine schema migration: add a column, backfill, carry on.
Instead, we triggered a row‑exclusive lock that never released, and Postgres quietly escalated it to a table‑level lock. For three full hours, every read and write to the rides table queued up behind that lock—effectively freezing our ride‑hailing platform.
What went wrong
No lock timeout
We hadn’t setlock_timeout
orstatement_timeout
, so the offending transaction waited forever, blocking everyone else.No alert on autovacuum / dead tuples
The lock halted autovacuum. Dead tuples piled up, but without monitoring onpg_stat_user_tables
, we stayed blind until connections maxed out.Migration ran in peak traffic
The migration script opened its lock during the lunch rush. With thousands of active rides, the backlog ballooned in minutes.
What we changed
Added safe defaults:
lock_timeout = '10s'
andstatement_timeout = '60s'
on every service connection.Alerted on autovacuum stalls and dead‑tuple thresholds via Grafana dashboards.
Adopted online migration patterns—rolling out schema changes in phases (
ADD NULLABLE
→ batched backfill →SET NOT NULL
).Introduced a migration window: all DDL changes happen during the lowest‑traffic hour, with a traffic guard that aborts if concurrency spikes.
Cultural lesson: Blameless post‑mortems
Instead of pointing fingers, we gathered everyone involved a blameless post‑mortem.
We asked “What in our process allowed this to happen?” rather than “Who caused it?”.
This mindset kept the team open and honest, surfaced hidden assumptions, and turned a painful outage into a shared learning experience.
From then on, “fix the system, not the person” became a standing principle in our incident reviews.
Three hours of silence from the database felt like an eternity, but the incident burned these habits into muscle memory: time‑box your locks, monitor what matters, treat migrations like production releases, and learn without blame.