Retry & Circuit Breakers: Keeping the BFF Breathing

Our Gojek mobile app talks to a single BFF, and that BFF fans out to half a dozen downstream services—Driver‑Location, Pricing, Promotions, Payments, you name it.
When even one of those services sneezed, the whole ride‑booking flow caught a cold: loaders spun, customers retried, and on‑call phones buzzed.

The “why didn’t we do this sooner?” fix

Retries with jittered back‑off
Most failures were momentary—GC pauses, brief network drops. A second attempt, spaced out with a bit of random delay, succeeded 80‑90 % of the time.
Circuit breaker per dependency
After N consecutive failures, we opened the breaker, stopped hammering the sick service, and served a graceful fallback (e.g., “Promo unavailable, try again soon”).
Healthy services kept answering; the whole app no longer froze because one spoke in the wheel locked up.

Lessons etched in muscle memory

Retries turn flakes into non‑events. Two or three attempts rescue most transient blips.
Circuit breakers protect the herd. Fail fast, fall back, let downstreams recover.
Observe and tune. We tracked breaker opens/ closes and tweak thresholds instead of guessing.
User trust is fragile. A single spinner feels like an eternity; silent resilience feels like magic.

A handful of resilience patterns—less code than the promo banner widget—gave millions of riders a smoother experience and gave on‑call engineers their weekends back.