The visible problem was payment failure rate. The real failure was timeout behavior: application workers gave up before the gateway finished, then retries increased pressure exactly when the dependency was slowest.

Detection

The first useful signal was not the total error count. It was the combination of gateway latency, duplicate authorization attempts, and queue age rising together. That showed a retry storm instead of isolated provider errors.

Mitigation

The team reduced worker concurrency, paused the noisiest producer, and made retries check durable payment state before calling the gateway again. That restored forward progress without double-charging users.

Prevention

The follow-up added per-provider timeout budgets, idempotency keys for all external calls, reconciliation alerts, and a runbook for switching payment flows into degraded mode.