Retry & Circuit Breakers: Keeping the BFF Breathing

April 7, 2021

Our Gojek mobile app talks to a single BFF, and that BFF fans out to half a dozen downstream services—Driver‑Location, Pricing, Promotions, Payments, you name it.
When even one of those services sneezed, the whole ride‑booking flow caught a cold: loaders spun, customers retried, and on‑call phones buzzed.

The “why didn’t we do this sooner?” fix

  1. Retries with jittered back‑off
    Most failures were momentary—GC pauses, brief network drops. A second attempt, spaced out with a bit of random delay, succeeded 80‑90 % of the time.

  2. Circuit breaker per dependency
    After N consecutive failures, we opened the breaker, stopped hammering the sick service, and served a graceful fallback (e.g., “Promo unavailable, try again soon”).
    Healthy services kept answering; the whole app no longer froze because one spoke in the wheel locked up.

Lessons etched in muscle memory

  • Retries turn flakes into non‑events. Two or three attempts rescue most transient blips.
  • Circuit breakers protect the herd. Fail fast, fall back, let downstreams recover.
  • Observe and tune. We tracked breaker opens/ closes and tweak thresholds instead of guessing.
  • User trust is fragile. A single spinner feels like an eternity; silent resilience feels like magic.

A handful of resilience patterns—less code than the promo banner widget—gave millions of riders a smoother experience and gave on‑call engineers their weekends back.