Our Gojek mobile app talks to a single BFF, and that BFF fans out to half a dozen downstream services—Driver‑Location, Pricing, Promotions, Payments, you name it.
When even one of those services sneezed, the whole ride‑booking flow caught a cold: loaders spun, customers retried, and on‑call phones buzzed.
The “why didn’t we do this sooner?” fix
Retries with jittered back‑off
Most failures were momentary—GC pauses, brief network drops. A second attempt, spaced out with a bit of random delay, succeeded 80‑90 % of the time.Circuit breaker per dependency
After N consecutive failures, we opened the breaker, stopped hammering the sick service, and served a graceful fallback (e.g., “Promo unavailable, try again soon”).
Healthy services kept answering; the whole app no longer froze because one spoke in the wheel locked up.
Lessons etched in muscle memory
- Retries turn flakes into non‑events. Two or three attempts rescue most transient blips.
- Circuit breakers protect the herd. Fail fast, fall back, let downstreams recover.
- Observe and tune. We tracked breaker opens/ closes and tweak thresholds instead of guessing.
- User trust is fragile. A single spinner feels like an eternity; silent resilience feels like magic.
A handful of resilience patterns—less code than the promo banner widget—gave millions of riders a smoother experience and gave on‑call engineers their weekends back.