Idempotency, Retries & Designing for Failure
Stage 3 · Architecture & System Design · B.U.I.L.D. letter: L
You vibe-coded a checkout flow in an afternoon and it works great — until a customer's internet hiccups mid-request, your frontend retries the payment, and they get charged twice. Distributed systems don't fail rarely; they fail constantly, in small, invisible ways. The engineers who build things that last aren't the ones who prevent failure — they're the ones who design around it.
⚠️ The vibe trap
When you're building alone and testing on localhost, every call succeeds, every response arrives, every operation completes exactly once. You ship to production and suddenly the network drops packets, APIs time out, servers restart mid-request, and your "send email once" function runs three times during a deploy. The vibe trap is treating failure as an edge case to handle later — in distributed systems, failure is the normal path, and the happy path is the lucky one. A retried payment that double-charges a customer doesn't just lose you money; it destroys trust you can't buy back.
🔑 Idempotency — the same operation twice, same result
If a caller can safely retry an operation without side effects, the operation is idempotent.
GET /users/42 is naturally idempotent — you can call it a thousand times and nothing changes. POST /payments is not naturally idempotent — two calls means two charges. The fix is an idempotency key: a unique ID the caller generates for each logical operation and sends with every attempt. Your server stores the result the first time it succeeds and returns the cached result for any duplicate.
// server: idempotency-key middleware for POST /payments
const processedKeys = new Map(); // In production: use Redis or a DB table
async function paymentsHandler(req, res) {
const idempotencyKey = req.headers['idempotency-key'];
if (!idempotencyKey) {
return res.status(400).json({ error: 'idempotency-key header required' });
}
// Already processed this exact logical operation — return cached result
if (processedKeys.has(idempotencyKey)) {
const cached = processedKeys.get(idempotencyKey);
return res.status(200).json({ ...cached, fromCache: true });
}
// First time we've seen this key — actually do the work
try {
const charge = await stripe.charges.create({
amount: req.body.amountCents,
currency: 'usd',
source: req.body.token,
});
const result = { chargeId: charge.id, status: 'ok' };
// Store before responding so a crash-before-response can still replay
processedKeys.set(idempotencyKey, result);
return res.status(201).json(result);
} catch (err) {
// Do NOT cache errors — let the caller retry with the same key
return res.status(502).json({ error: err.message });
}
}
// client: generate once per logical operation, reuse on every retry
function generateIdempotencyKey() {
return `pay-${Date.now()}-${Math.random().toString(36).slice(2)}`;
}
Mental model: Think of the idempotency key as a receipt number. The warehouse ships the order when it sees receipt #1042 for the first time. If the courier drops the paperwork and you resubmit #1042, the warehouse checks its ledger, sees it already shipped, and hands you the same confirmation slip instead of shipping again.
Why it matters: Retries are unavoidable. The question is whether your system handles them gracefully or silently corrupts data. Stripe, Twilio, and every serious payments API require idempotency keys for exactly this reason.
Common mistake: Generating a new idempotency key on every retry. That defeats the entire purpose — the key must be generated once per logical operation and re-sent on every attempt of that same operation.
⏱️ Retries with Exponential Backoff + Jitter
When a call fails, waiting before retrying — and waiting longer each time — protects both you and the service you're calling.
Instant retries under failure hammer an already-struggling service and can cause cascading outages across every caller. Exponential backoff doubles the wait on each attempt. Jitter (random noise added to the delay) spreads out retries so a thousand clients don't all wake up and retry at the exact same millisecond.
// Retry with exponential backoff and full jitter
async function fetchWithRetry(url, options = {}, maxAttempts = 4) {
const BASE_DELAY_MS = 200; // start at 200 ms
const MAX_DELAY_MS = 10000; // never wait more than 10 s
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
const controller = new AbortController();
// Always set a timeout — never wait forever
const timeoutId = setTimeout(() => controller.abort(), 5000);
const response = await fetch(url, {
...options,
signal: controller.signal,
});
clearTimeout(timeoutId);
// Treat 5xx as retriable; 4xx (client errors) are not worth retrying
if (response.status >= 500 && attempt < maxAttempts) {
throw new Error(`Server error ${response.status}`);
}
return response;
} catch (err) {
if (attempt === maxAttempts) throw err; // out of attempts — surface the error
// Exponential backoff: 200ms → 400ms → 800ms → …
const exponentialDelay = Math.min(BASE_DELAY_MS * 2 ** (attempt - 1), MAX_DELAY_MS);
// Full jitter: random value in [0, exponentialDelay]
const jitter = Math.random() * exponentialDelay;
console.warn(`Attempt ${attempt} failed (${err.message}). Retrying in ${Math.round(jitter)}ms…`);
await new Promise(resolve => setTimeout(resolve, jitter));
}
}
}
Mental model: Imagine the server is a cashier having a rough day. Tapping them on the shoulder every 100ms makes things worse for everyone. Waiting a bit, then a bit more, gives them time to recover — and adding random jitter means you and the other hundred customers in line don't all tap simultaneously.
Why it matters: A naive while (true) { retry(); } loop under a partial outage will amplify the outage into a full one. Backoff with jitter is the difference between a 30-second blip and a 30-minute meltdown.
Common mistake: Retrying on all errors. A 400 Bad Request means your request was malformed — retrying it will always fail. Only retry on 5xx (server fault) and network-level failures like timeouts. Check the status code before retrying.
🔌 Circuit Breakers — stop hammering a dead dependency
When a downstream service is failing, open the circuit so you stop sending it requests — fast-fail instead of slow-fail.
A circuit breaker has three states: Closed (normal — requests flow through), Open (dependency is down — fast-fail everything immediately), and Half-Open (testing recovery — let one request through to see if the service is back). Without a circuit breaker, every user request hangs waiting for a timeout, your thread pool fills up, and a failure in one service topples every service that depends on it.
class CircuitBreaker {
constructor(fn, { failureThreshold = 3, recoveryTimeMs = 15000 } = {}) {
this.fn = fn;
this.failureThreshold = failureThreshold;
this.recoveryTimeMs = recoveryTimeMs;
this.failureCount = 0;
this.state = 'CLOSED'; // CLOSED | OPEN | HALF_OPEN
this.openedAt = null;
}
async call(...args) {
if (this.state === 'OPEN') {
const elapsed = Date.now() - this.openedAt;
if (elapsed < this.recoveryTimeMs) {
throw new Error('Circuit OPEN — fast-failing, not even trying');
}
// Enough time has passed — probe with one request
this.state = 'HALF_OPEN';
}
try {
const result = await this.fn(...args);
// Success — reset the breaker
this.failureCount = 0;
this.state = 'CLOSED';
return result;
} catch (err) {
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
console.error(`Circuit breaker OPENED after ${this.failureCount} failures`);
}
throw err;
}
}
}
// Usage
const breaker = new CircuitBreaker(
(userId) => fetch(`https://recommendations-service/recs/${userId}`).then(r => r.json()),
{ failureThreshold: 3, recoveryTimeMs: 15000 }
);
async function getRecommendations(userId) {
try {
return await breaker.call(userId);
} catch {
// Graceful degradation: return popular items instead of crashing the page
return getDefaultRecommendations();
}
}
Mental model: Your house's circuit breaker trips when a wire overloads — it doesn't keep sending electricity hoping the problem resolves itself. It cuts power fast to prevent damage, and you reset it manually once you've fixed the fault. Software circuit breakers work the same way.
Why it matters: Cascading failures — where one slow dependency makes every service that calls it slow, which makes every service calling those services slow — are how distributed systems fail catastrophically. Circuit breakers contain the blast radius.
Common mistake: Setting the failure threshold too high (so the breaker never trips) or the recovery window too short (so it opens and closes constantly, letting storms of failures through repeatedly). Tune these numbers with real production data.
🌊 Graceful Degradation — serve less, not nothing
When a dependency fails, give users something useful rather than an error screen.
Graceful degradation is the user-visible expression of resilience. Instead of propagating an exception to the user's screen, you decide in advance: "If the recommendations service is down, show bestsellers. If the profile image CDN is down, show an avatar placeholder. If the search index is slow, return cached results from 5 minutes ago."
Request → Try live recommendations service
│
├─ SUCCESS → return fresh recommendations
│
└─ FAILURE (circuit open / timeout / 5xx)
│
├─ Try cache (Redis / CDN edge)
│ │
│ ├─ CACHE HIT → return stale-but-useful data
│ │ (log staleness, surface "results may be outdated")
│ │
│ └─ CACHE MISS → return safe default
│ (popular items, empty state, "try again later")
│
└─ NEVER propagate a raw 500 to the user
Log the failure, alert on-call, but give the UI something to render
Mental model: A pilot whose primary radio fails doesn't land the plane — they switch to the backup radio, then the emergency frequency. Your system needs the same hierarchy of fallbacks designed before the failure, not improvised during an incident at 2 a.m.
Why it matters: Users tolerate slightly stale data far better than they tolerate blank screens. "Here are last week's results — we're having a small issue" keeps trust intact. An unhandled 500 destroys it.
Common mistake: Only designing the happy path and treating fallbacks as an afterthought. The fallback logic needs to be written, deployed, and tested (by intentionally breaking the dependency in staging) before you ever need it in production.
🌐 The Fallacies of Distributed Computing
Before you assume your system will behave like localhost, internalize the eight fallacies that bite every engineer who doesn't:
- The network is reliable. (It isn't — packets drop, cables get cut, DNS fails.)
- Latency is zero. (A cross-datacenter call is ~100ms; a cross-continent call is worse.)
- Bandwidth is infinite. (Sending large payloads under load will surprise you.)
- The network is secure. (It isn't — design as if every call can be intercepted or replayed.)
- Topology doesn't change. (Services move, IPs change, new nodes join and leave.)
- There is one administrator. (Multiple teams own multiple services — no single person knows everything.)
- Transport cost is zero. (Serialization, TLS, and routing all have real CPU and time costs.)
- The network is homogeneous. (Clients use different SDKs, versions, and protocols.)
These were documented in the 1990s. Engineers still re-learn them the hard way every decade. You don't have to.
The one sentence to carry forward: Every remote call in a distributed system is an operation that might not complete, might complete more than once, and might return a result you can't fully trust. Design accordingly.
🛠️ Your mission
Pick one operation in your current project that has real-world consequences if it runs more than once — a payment, an email send, a database write, a webhook handler.
-
Make it idempotent. Add an
idempotency-keyheader (or a uniquerequestIdfield in the body). On your server, check the key against a table or in-memory store before executing. Return the cached result on duplicates. -
Add a timeout and backoff to one outbound call. Find any
fetch()or third-party SDK call in your codebase that has no timeout. Wrap it withfetchWithRetry(or the pattern from this lesson) with a 5-second abort timeout and at least three attempts using exponential backoff with jitter. -
Design one fallback. For that same operation, write down (or implement) what your system should return if the external service is completely unavailable. Implement the fallback branch — even if it's just returning a default value with a log line.
✅ You're done when…
- Your critical operation appears in the Production-Readiness Checklist with idempotency, timeout, and fallback all checked off
- Calling your endpoint twice with the same idempotency key produces exactly one side effect (one charge, one email, one write) and returns the same response body both times
- Your retry wrapper applies exponential backoff with jitter and does not retry on
4xxclient errors - You can explain in plain English what state your circuit breaker would be in during a full downstream outage, and what users would see
- A teammate can read your fallback logic and understand, without asking you, what the system serves when the primary call fails
➡️ Next: the Capstone — Design a Complete System. Build It Right, Or Don't Build It At All. 🏛️