Design a distributed rate limiter
{tokens, lastRefill} per key; on each request refill by elapsed×rate (capped), allow if ≥1 then decrement. Return 429 + Retry-After + X-RateLimit-*.System design
A senior-level tour of backend system design, mapped onto NestJS and Node.js. This is the “how do you think about building a backend at scale” material — decisions before code, for the worst query, the worst network, and the worst traffic spike.
System design is decision-making before code. A structured flow keeps you calm under pressure:
| Step | What you do | Time |
|---|---|---|
| C — Clarify | Functional vs non-functional requirements, scale (RPS, DAU), read/write ratio, consistency needs. Most candidates rush this and design the wrong system. | ~15% |
| R — Rough HLD | Draw the high-level diagram: clients, gateway, services, datastores, queues. A city map, not a street map. | ~20% |
| D — Deep dive | Zoom into one component (the data model, the rate limiter, the queue) — interfaces, edge cases. | ~35% |
| D — Discuss trade-offs | Name alternatives and why you chose one. The senior signal. | ~20% |
| S — Summarize | Recap, call out risks, handle follow-ups. | ~10% |
Almost every Nest service shares the same skeleton — memorize it as your HLD template:
| Layer | Responsibility | Nest equivalent |
|---|---|---|
| Presentation | HTTP/GraphQL/WS edge: parse, validate, serialize. | Controllers, resolvers, gateways, DTOs, pipes |
| Application / Domain | Use-cases and business rules — the most testable layer. | Services, use-case classes, domain models |
| Data / Repository | Combines remote + local, caching, maps rows → domain. | Repository providers over TypeORM/Prisma/Mongoose |
| Integration | External APIs, message brokers, cache. | HTTP clients, ClientProxy, cache-manager |
For complex, long-lived domains, push to hexagonal architecture: the domain depends on ports (interfaces + DI tokens), and adapters bind concrete implementations — so infrastructure is swappable and the domain is pure. Don't pay that indirection cost on thin CRUD.
A production client to another service needs more than fetch: timeouts (never wait forever), retries with exponential backoff + jitter (only for idempotent/transient failures — network, 5xx, not 4xx), a circuit breaker (stop hammering a downed dependency), and a bulkhead (cap concurrent calls so one slow dependency can't exhaust your pool).
Wrap calls in a result/Observable with timeout(), propagate a correlation id (traceparent) for tracing, and map foreign DTOs to your domain at the boundary (an anti-corruption layer) so their schema changes don't ripple into your code.
Pick the store for the access pattern: relational (Postgres) for transactional integrity + joins; document (Mongo) for flexible schemas; key-value (Redis) for hot/ephemeral data; wide-column (Cassandra/Dynamo) for huge write-heavy time-series. Replication scales reads + HA (mind replica lag); sharding scales writes (pick a good shard key; cross-shard joins hurt).
Caching layers: in-memory (fast, per-instance, lost on restart) and shared Redis (survives restarts, consistent across replicas). Strategies: cache-aside (default), read/write-through, write-behind. Invalidate via TTL, delete-on-write, or versioned keys; guard stampede with a lock/single-flight. Name stale-while-revalidate.
Node scales by staying non-blocking and stateless. Keep CPU work off the event loop (worker threads / queues). Scale horizontally: run one process per core (cluster) or, more commonly, N container replicas behind a load balancer. Externalize all state (sessions, cache, uploads) to Redis/DB/object storage so any replica can serve any request.
Decouple work with asynchrony. Queues (BullMQ/SQS/RabbitMQ) distribute work to competing consumers (consume-once); streams/logs (Kafka) are durable, replayable, ordered-per-partition, multi-consumer. Use queues for jobs (email, image processing); use streams for event sourcing and fan-out to many consumers.
Delivery is at-least-once (visibility timeout → redelivery on crash), so consumers must be idempotent. Retries use backoff + jitter → a DLQ with alerting. Back-pressure: autoscale workers on queue depth.
Start with a modular monolith: one module per bounded context, each owning its tables, cross-module calls through a small public API + adapter. One deploy, atomic refactors, in-process transactional calls — enforce boundaries in CI with dependency-cruiser.
Extract a microservice only for a concrete need: independent scaling, deployment, fault isolation, or team ownership. If boundaries were clean, you swap the in-process public service for a transport client implementing the same interface. Cross-service consistency uses sagas (compensating actions) + the transactional outbox, not 2PC.
Design for failure as the default. The toolkit: timeouts everywhere, retries (idempotent only, backoff + jitter), circuit breakers, bulkheads, graceful degradation (serve stale cache / a reduced response instead of an error), and idempotency keys so retried side effects run at-most-once.
Fail fast and visibly: validate at the edge, cap request size and concurrency, and propagate cancellation (AbortSignal). Decide fail-open vs fail-closed per feature (a downed rate-limiter Redis: fail-open for availability, fail-closed for abuse-sensitive endpoints).
Choose an isolation model early (hard to reverse): Silo (DB per tenant — strongest isolation/compliance, priciest), Pool (shared tables + tenant_id — cheapest, noisy-neighbor risk), Bridge (schema per tenant). Tiers often mix (enterprise = silo, SMB = pool).
Resolve the tenant from subdomain/JWT claim, set it in request context (CLS), and enforce it everywhere — Postgres RLS makes a forgotten filter unable to leak rows. Scope cache keys by tenant; add per-tenant quotas/rate limits for noisy neighbors.
You can't operate what you can't see. Instrument four signals: health (liveness/readiness probes), metrics (RED — rate/errors/duration — plus Node internals: event-loop lag, heap, GC), traces (OpenTelemetry, propagated via traceparent), and structured logs with correlation ids.
Define SLOs (e.g. p99 latency < 300ms, 99.9% availability), track an error budget, and alert on guardrails — not on raw CPU. Treat a deploy as “done” only when crash/error rates hold across the rollout.
Six classic backend system-design prompts, each as Concept → Example → Gotcha → Senior answer. These are the ones interviewers reach for — rehearse the trade-offs out loud.
{tokens, lastRefill} per key; on each request refill by elapsed×rate (capped), allow if ≥1 then decrement. Return 429 + Retry-After + X-RateLimit-*.pipeline(readStream → transform → res); multipart upload for very large files.pipeline (auto backpressure + cleanup); validate size + magic-number type with ParseFilePipe; process derivatives (thumbnails, virus scan) async via a queue.