Background Jobs & Async Work
Stage 3 · Backend & APIs · B.U.I.L.D. letter: D
The moment you make a user wait six seconds for a confirmation email, you've broken their trust — and you didn't have to.
⚠️ The vibe trap
You wired up the "sign up" button and it works: the route creates the user, sends a welcome email via a third-party service, resizes their avatar, and finally returns 200 OK. During testing that felt fine. In production, under load, each of those extra steps adds hundreds of milliseconds — sometimes seconds — to every single request. Your server's thread is frozen, waiting on network calls it can't control. Under moderate traffic, threads pile up, the event loop chokes, unrelated requests start timing out, and your users see a blank spinner. The fix is not a faster computer. It is understanding that the HTTP response and the work triggered by that request are two completely separate things.
🧵 The queue + worker pattern
Every slow operation follows the same three-step shape:
- Enqueue — the web process drops a small job description into a queue and immediately returns
202 Acceptedto the caller. - Process — a separate worker process pulls jobs off the queue and does the real work, completely outside the HTTP cycle.
- Persist — the worker updates the database when it finishes (or fails), so the rest of the system can react.
Think of the HTTP request thread as a restaurant waiter. The waiter takes your order, hands it to the kitchen, and immediately walks back to greet the next table. The waiter never stands at the stove. The kitchen (your worker) does the cooking on its own schedule.
The simplest in-process illustration — using Node's EventEmitter as a stand-in queue — shows the shape clearly before you reach for Redis or a managed queue:
// queue.js — a minimal in-process job queue (dev/demo only)
import { EventEmitter } from 'events';
const emitter = new EventEmitter();
// Enqueue: caller drops a job and moves on immediately
export function enqueue(jobType, payload) {
// setImmediate pushes work to the NEXT iteration of the event loop,
// so the calling function always returns before the job starts.
setImmediate(() => emitter.emit(jobType, payload));
}
// Register a handler — your "worker" for this job type
export function onJob(jobType, handler) {
emitter.on(jobType, handler);
}
// routes/users.js — the web process
import express from 'express';
import { createUser } from '../db/users.js';
import { enqueue } from '../queue.js';
const router = express.Router();
router.post('/users', async (req, res) => {
const user = await createUser(req.body);
// Hand the slow work to the queue — do NOT await email sending here
enqueue('send-welcome-email', { userId: user.id, email: user.email });
// Return immediately. The email will send in the background.
res.status(202).json({ id: user.id, status: 'created' });
});
export default router;
// workers/email.js — registered at startup, runs in the background
import { onJob } from '../queue.js';
import { sendEmail } from '../lib/mailer.js';
onJob('send-welcome-email', async ({ userId, email }) => {
await sendEmail({
to: email,
subject: 'Welcome to the platform',
template: 'welcome',
data: { userId },
});
console.log(`Welcome email sent to ${email}`);
});
Why this matters: The HTTP route returns in under 5 ms. The email sends whenever the event loop is free. Your users get an instant response and still receive their email.
Common mistake: await-ing the slow operation inside the route handler. If you write await sendEmail(...) directly in the route, you have not moved anything to the background — you've just added async syntax around a blocking call.
🔁 Idempotent jobs and safe retries
Networks fail. Processes crash. Your worker will sometimes die mid-job. When it restarts and picks up the same job again, running it a second time must produce the same result as running it once. That property is called idempotency.
Design every job handler so it is safe to run multiple times. The pattern: check if the work has already been done before doing it.
// workers/resizeAvatar.js
import { onJob } from '../queue.js';
import { getUserById, markAvatarResized } from '../db/users.js';
import { resizeImage } from '../lib/images.js';
onJob('resize-avatar', async ({ userId, originalUrl }) => {
const user = await getUserById(userId);
// IDEMPOTENCY CHECK — if already done, skip silently and succeed
if (user.avatarResized) {
console.log(`Avatar for user ${userId} already resized. Skipping.`);
return;
}
const resizedUrl = await resizeImage(originalUrl, { width: 256, height: 256 });
await markAvatarResized(userId, resizedUrl);
console.log(`Avatar resized for user ${userId}`);
});
Why this matters: Without the idempotency check, a crash-and-retry doubles (or triples) the work — duplicate emails sent, duplicate rows inserted, duplicate charges made. Idempotency is what makes retries safe instead of dangerous.
Common mistake: Using a job ID as the idempotency key but forgetting to persist it. If the job completes but the DB write fails, the next retry sees no record of completion and runs the whole thing again. Always commit the completion marker in the same transaction as the job's side effect.
⏳ Retries with exponential backoff
Transient failures — a third-party API returning 503, a brief network blip — should trigger an automatic retry. But retrying instantly in a tight loop hammers the already-struggling service. The solution is exponential backoff: wait longer before each successive attempt.
// lib/retryJob.js
// A generic wrapper that retries a job handler with exponential backoff.
const MAX_ATTEMPTS = 5;
const BASE_DELAY_MS = 500; // 500ms, 1s, 2s, 4s, 8s
export async function withRetry(jobName, payload, handler) {
let attempt = 0;
while (attempt < MAX_ATTEMPTS) {
try {
await handler(payload);
return; // success — exit the loop
} catch (err) {
attempt++;
if (attempt >= MAX_ATTEMPTS) {
// All retries exhausted — route to dead-letter handling
await sendToDeadLetter(jobName, payload, err);
return;
}
const delayMs = BASE_DELAY_MS * Math.pow(2, attempt - 1);
console.warn(
`[${jobName}] attempt ${attempt} failed: ${err.message}. ` +
`Retrying in ${delayMs}ms…`
);
await sleep(delayMs);
}
}
}
function sleep(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
// Placeholder — in production this writes to a dead_letter_jobs table
async function sendToDeadLetter(jobName, payload, err) {
console.error(`[DEAD LETTER] ${jobName} permanently failed:`, {
payload,
error: err.message,
});
// await db.query(
// 'INSERT INTO dead_letter_jobs (job_name, payload, error) VALUES ($1, $2, $3)',
// [jobName, JSON.stringify(payload), err.message]
// );
}
// workers/emailWithRetry.js — using the retry wrapper
import { onJob } from '../queue.js';
import { withRetry } from '../lib/retryJob.js';
import { sendEmail } from '../lib/mailer.js';
onJob('send-welcome-email', (payload) =>
withRetry('send-welcome-email', payload, async ({ userId, email }) => {
await sendEmail({ to: email, subject: 'Welcome!', template: 'welcome', data: { userId } });
})
);
Why this matters: A flat retry loop (while (failed) retry()) can amplify a third-party outage into a thundering herd that makes the outage worse. Exponential backoff spreads the load and gives the upstream service time to recover.
Common mistake: Not adding jitter. If hundreds of jobs all failed at the same moment, they all retry at exactly 500ms, 1000ms, 2000ms… together. Add a small random jitter (delayMs + Math.random() * 200) to spread the spike.
💀 Dead-letter handling: accepting permanent failure
Some jobs will never succeed — an email address is permanently invalid, an external API will never accept a particular payload, the data itself is corrupted. Retrying these forever wastes resources and hides real bugs. The pattern is a dead-letter queue (DLQ): a holding area for jobs that have exhausted all retries.
// db/deadLetter.js — writing failed jobs to Postgres for human review
import { query } from './index.js';
export async function recordDeadLetter({ jobName, payload, error, attempts }) {
await query(
`INSERT INTO dead_letter_jobs (job_name, payload, error_message, attempts, failed_at)
VALUES ($1, $2, $3, $4, NOW())`,
[jobName, JSON.stringify(payload), error.message, attempts]
);
}
The matching migration creates the table:
-- supabase/migrations/XXX_dead_letter_jobs.sql
CREATE TABLE IF NOT EXISTS dead_letter_jobs (
id BIGSERIAL PRIMARY KEY,
job_name TEXT NOT NULL,
payload JSONB NOT NULL,
error_message TEXT,
attempts INT DEFAULT 0,
failed_at TIMESTAMPTZ DEFAULT NOW(),
resolved_at TIMESTAMPTZ
);
An admin dashboard query surfaces what needs human attention:
// admin: fetch unresolved dead-letter jobs
const { rows } = await query(
`SELECT * FROM dead_letter_jobs WHERE resolved_at IS NULL ORDER BY failed_at DESC`
);
Why this matters: Without a DLQ, permanently-failing jobs either loop forever (consuming resources) or disappear silently (you never know). A DLQ makes failure visible, auditable, and actionable.
Common mistake: Treating the DLQ as a graveyard you never look at. Wire up a simple alert — even a daily cron that emails you if dead_letter_jobs has unresolved rows older than 24 hours — so failures surface before they become user complaints.
🕐 Scheduled and cron jobs
Not all background work is triggered by a user request. Some jobs run on a schedule: send a weekly digest every Monday, purge expired sessions nightly, regenerate a sitemap every hour. These are cron jobs — named after the Unix cron daemon.
// cron/scheduler.js — using the 'node-cron' package
import cron from 'node-cron';
import { purgeExpiredSessions } from '../workers/sessions.js';
import { sendWeeklyDigests } from '../workers/digests.js';
// Runs every night at 2:00 AM UTC
cron.schedule('0 2 * * *', async () => {
console.log('[cron] Purging expired sessions…');
try {
await purgeExpiredSessions();
} catch (err) {
console.error('[cron] Session purge failed:', err.message);
}
});
// Runs every Monday at 8:00 AM UTC (cron: minute hour day month weekday)
cron.schedule('0 8 * * 1', async () => {
console.log('[cron] Sending weekly digests…');
try {
await sendWeeklyDigests();
} catch (err) {
console.error('[cron] Digest send failed:', err.message);
}
});
// workers/sessions.js — the scheduled job itself
import { query } from '../db/index.js';
export async function purgeExpiredSessions() {
const { rowCount } = await query(
`DELETE FROM sessions WHERE expires_at < NOW()`
);
console.log(`Purged ${rowCount} expired sessions.`);
}
Mental model: A cron job is just a background job that enqueues itself on a timer instead of being enqueued by a route.
Why this matters: Without a scheduled purge, tables like sessions, password_reset_tokens, and rate_limit_windows grow without bound. Scheduled maintenance keeps your DB lean and your app fast.
Common mistake: Running cron jobs inside the same process as your web server when you have multiple web server instances (e.g., Heroku dynos, Fly.io machines). Every instance will fire the cron, doubling or tripling execution. Use a dedicated worker dyno/process for cron, or add a distributed lock.
🛠️ Your mission
Find one slow operation in your current project — sending email, calling a third-party API, resizing an image, generating a PDF — and move it to a background job.
- Create a
queue.jsfile (or wire up BullMQ / Supabase pg_notify if you already have Redis or Postgres) following the enqueue/worker shape above. - Refactor the slow route to call
enqueue(...)instead ofawait slowThing(...), and change the response to202 Accepted. - Create a worker file that handles the job, includes an idempotency check, and wraps execution in
withRetry. - Write a dead-letter record to your database if all retries fail.
- If you have any recurring maintenance (token expiry, digest emails), add one cron entry.
✅ You're done when…
- The slow route responds in under 100 ms (verify with
curl -w "\nTotal: %{time_total}s\n") — no longer blocked by the expensive operation. - The job handler is idempotent: running it twice on the same payload produces the same outcome as running it once (test by calling the handler manually with a payload that already has a completion record).
- Permanent failures write a row to
dead_letter_jobs(or equivalent) rather than disappearing silently — confirm by triggering a handler that always throws and checking the table. - At least one scheduled job is registered and its cron expression is documented with the timezone it targets (Production-Readiness Checklist item: all background processes are observable and their schedules are documented).
➡️ Next: Webhooks & Integrations. Build It Right, Or Don't Build It At All. 🏛️