Scalable systems are not built by adding more servers after the first outage. They are built by making every critical path explicit: where requests enter, where state changes, where work can be retried, and where operators can safely stop the line.

Start with the pressure points

The first design pass should identify write hot spots, external dependencies, queue boundaries, cache invalidation, and the few database records that many workers may compete to update. Those pressure points decide the architecture more than the framework does.

Make retries boring

Every background job should be safe to run twice. That means durable idempotency keys, clear state transitions, and audit events that describe the business action, not just the function name. When a worker crashes, the next attempt should read the source of truth before doing anything irreversible.

Design for incident operation

A production-ready system gives operators levers: throttle producers, reduce consumer concurrency, disable noncritical flows, roll back a release, and inspect a small number of trustworthy metrics. Scalability is not only throughput; it is the ability to recover without guessing.