Payment Gateway Timeout During High Traffic
How a timeout configuration caused failures and what we changed to fix it.
The visible problem was payment failure rate. The real failure was timeout behavior: application workers gave up before the gateway finished, then retries increased pressure exactly when the dependency was slowest.
Detection
The first useful signal was not the total error count. It was the combination of gateway latency, duplicate authorization attempts, and queue age rising together. That showed a retry storm instead of isolated provider errors.
Mitigation
The team reduced worker concurrency, paused the noisiest producer, and made retries check durable payment state before calling the gateway again. That restored forward progress without double-charging users.
Prevention
The follow-up added per-provider timeout budgets, idempotency keys for all external calls, reconciliation alerts, and a runbook for switching payment flows into degraded mode.