API Rate Limiting Strategies: 3 Implementations, 1 Clear Winner

You almost certainly have app.use(rateLimit({...})) somewhere in your Express app. It works. You moved on with your life. But you never actually picked an algorithm — you got whatever express-rate-limit hands you by default, which is a fixed window in memory.

Three algorithms matter for HTTP APIs: fixed window, sliding window log, and token bucket. The wrong one will quietly let bursts through, throttle real users for behavior they thought was fine, or both. The right one is a 30-line middleware. So which is yours, and what does the code actually look like?

Which Rate Limiting Algorithm Should I Use?

Use sliding window log for most APIs — it’s fair, accurate, and handles bursts gracefully. Use token bucket for high-throughput APIs that need controlled burst tolerance (above ~1000 RPS per key, or workloads where short bursts are normal). Use fixed window only for internal services where burst-at-boundary issues don’t matter.

That’s the verdict. The rest of this article shows what each one looks like as Express middleware backed by Redis, and exactly why the default you have right now is leaking in two different ways.

The Default You Already Have (And Why It Bursts)

Here’s what most Express apps ship with:

import rateLimit from 'express-rate-limit';

app.use(rateLimit({
  windowMs: 60_000, // 1 minute
  max: 100,
}));

That uses a fixed window with in-memory storage. Both halves are problems.

The boundary burst. A fixed window resets cleanly at 12:00:00. So a user can fire 100 requests at 11:59:59 and 100 more at 12:00:00 — 200 requests in two seconds, against a stated limit of 100 per minute. The math is technically honest. The user-facing behavior is not.

The multi-instance lie. In-memory counters live on a single Node process. The moment you scale to two instances behind a load balancer, each one counts independently. Four instances? Your “100/min” is actually 400/min, distributed unevenly depending on which instance the load balancer picks. If you’ve ever stared at production metrics that don’t match your stated limit and assumed users were cheating, this is usually why.

Both problems have the same fix: move the counter to Redis, and pick an algorithm on purpose. Here’s the simplest version.

Strategy 1: Fixed Window with Redis (When It’s Actually Fine)

Fixed window in Redis is two commands: increment a key for the current window, set a TTL on first hit. If the count exceeds your limit, reject.

import Redis from 'ioredis';
const redis = new Redis();

export function fixedWindow({ limit, windowMs }) {
  return async (req, res, next) => {
    const window = Math.floor(Date.now() / windowMs);
    const key = `ratelimit:fixed:${req.ip}:${window}`;

    const count = await redis.incr(key);
    if (count === 1) {
      await redis.pexpire(key, windowMs);
    }

    const remaining = Math.max(0, limit - count);
    res.setHeader('X-RateLimit-Limit', limit);
    res.setHeader('X-RateLimit-Remaining', remaining);

    if (count > limit) {
      return res.status(429).json({ error: 'rate_limited' });
    }
    next();
  };
}

This is a real fix for the multi-instance problem. All your Node processes increment the same Redis key, so 100/min means 100/min regardless of how many instances you run.

It does not fix the boundary burst. That’s by design — the window resets cleanly every windowMs, so a user who paces themselves to the boundary still gets a 2x effective limit.

When that’s fine: internal admin APIs, background job triggers, anything where “occasionally allows double the limit at minute boundaries” is acceptable because nobody’s gaming it. When it’s not: anything user-facing, public endpoints, paying customers. They’ll notice 429s for behavior they thought was within the rules, and they’ll be right.

For everything in that second bucket, you want a true rolling window. That’s about 10 more lines.

Strategy 2: Sliding Window Log (The Right Default)

The idea: store a timestamp per request in a Redis sorted set keyed by user or IP. On each request, remove timestamps older than now - windowMs, count what’s left, then add the new one. The window is always exactly the last N seconds — no resets, no boundary burst.

export function slidingWindow({ limit, windowMs }) {
  return async (req, res, next) => {
    const now = Date.now();
    const key = `ratelimit:sliding:${req.ip}`;
    const cutoff = now - windowMs;

    const results = await redis.multi()
      .zremrangebyscore(key, 0, cutoff)
      .zcard(key)
      .zadd(key, now, `${now}-${Math.random()}`)
      .pexpire(key, windowMs)
      .exec();

    const count = results[1][1];
    const remaining = Math.max(0, limit - count - 1);

    res.setHeader('X-RateLimit-Limit', limit);
    res.setHeader('X-RateLimit-Remaining', remaining);

    if (count >= limit) {
      await redis.zrem(key, `${now}-${Math.random()}`); // undo the add
      return res.status(429).json({ error: 'rate_limited' });
    }
    next();
  };
}

The whole sequence — purge, count, add, expire — runs as a single pipelined round trip. One network hop per request. The unique member suffix (${now}-${Math.random()}) keeps two requests in the same millisecond from collapsing into one sorted-set entry.

Why this is fair: it’s a true rolling window. A user who sends 100 requests in a one-second burst gets throttled for the next 59 seconds, exactly as the limit promised. No clean resets. No 2x burst at the boundary. The accounting matches what you told them.

The honest cost: one sorted-set entry per request per user. That’s O(N) memory in the window size. Fine for SaaS dashboards, mobile app backends, public REST APIs under ~1000 RPS per key. Becomes painful well above that — and that’s the point at which you switch.

This is the right default for almost everyone reading. If you only change one thing after closing this tab, change to sliding window log keyed by user ID for authenticated traffic and IP for anonymous traffic. You’re done.

But if your API is past the throughput line, sliding window log will start costing you Redis memory you don’t want to spend. Time for the third one.

Strategy 3: Token Bucket (For Burst-Tolerant, High-Throughput APIs)

Each key gets a bucket with capacity tokens. It refills at refillRate tokens per second. Each request consumes one. If the bucket has tokens, allow. If not, reject. The bucket allows controlled bursts (drain to zero quickly) followed by paced sustained traffic (refill rate sets the long-run limit).

The trick is atomicity. You need to read the current token count, calculate how many tokens have refilled since last access, decrement by one, and write back — without another request sneaking in between. Doing this in separate Redis commands is a race condition waiting to surface at 3 AM. Lua script.

-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])        -- ms

local bucket = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or capacity
local ts = tonumber(bucket[2]) or now

local delta = math.max(0, now - ts) / 1000
tokens = math.min(capacity, tokens + delta * refillRate)

local allowed = tokens >= 1
if allowed then tokens = tokens - 1 end

redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, math.ceil(capacity / refillRate * 1000))
return { allowed and 1 or 0, math.floor(tokens) }

The middleware loads the script once and calls it per request:

const script = fs.readFileSync('./token_bucket.lua', 'utf8');

export function tokenBucket({ capacity, refillRate }) {
  return async (req, res, next) => {
    const key = `ratelimit:bucket:${req.ip}`;
    const [allowed, remaining] = await redis.eval(
      script, 1, key, capacity, refillRate, Date.now()
    );
    res.setHeader('X-RateLimit-Limit', capacity);
    res.setHeader('X-RateLimit-Remaining', remaining);
    if (!allowed) return res.status(429).json({ error: 'rate_limited' });
    next();
  };
}

When to actually reach for this: APIs above ~1000 RPS per key, APIs where short bursts are normal and desirable (a user uploading 50 photos in a batch, a payment processor flushing a queue), ML inference endpoints where smoothness matters more than strict per-minute caps. Memory cost is O(1) per key — two fields, regardless of request volume. That’s the real reason it scales.

When it’s overkill: every other API. Sliding window log is easier to explain to your team, easier to debug from the Redis CLI, and easier to reason about when the on-call alert fires. Don’t pick token bucket because it sounds sophisticated. Pick it because you’ve measured something sliding window log can’t handle.

You’ve picked an algorithm. The middleware works. There’s one piece left that determines whether your API is actually pleasant to integrate with — and almost every rate-limiting tutorial skips it.

The Response Pattern Nobody Gets Right

Returning HTTP 429 is the easy part. Returning it correctly is what separates a rate limiter from a production-ready rate limiter.

Status code: 429 Too Many Requests. Not 503. Not 403. Clients have built-in handling for 429 — HTTP libraries in every language know to look at Retry-After. Use 503 and you’re telling the world your service is down.

Required headers on rate-limited responses:

HTTP/1.1 429 Too Many Requests
Retry-After: 42
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1717689600

Retry-After is seconds until the next allowed request. X-RateLimit-Reset is the Unix timestamp of the next window reset (or, for token bucket, when the next token will be available). Set both. Clients can use whichever is cleaner for their backoff logic.

Set the same X-RateLimit-* headers on successful responses too. A well-behaved client paces itself when it sees Remaining: 3 instead of waiting for the 429. That single change reduces your 429 rate dramatically without changing your limits.

A small JSON body for humans:

{ "error": "rate_limited", "retryAfter": 42, "message": "Try again in 42 seconds." }

Useful for debugging from a browser. Costs you 80 bytes.

The IETF draft headers. The standards body is moving toward RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset (no X- prefix). Set both during the transition. Costs nothing, future-proofs the API.

Key by the right dimension. Algorithm choice is half the decision. The other half is what you key on. Authenticated requests should key by req.user.id — IP isn’t stable enough for mobile users on flaky networks. Anonymous requests key by IP. Expensive endpoints (search, AI inference, image uploads) get a separate, stricter limit keyed by endpoint plus user. Stack the middleware: a global limit per user, plus a per-endpoint limit on the routes that hurt when abused. Your REST API error handling already follows this layered pattern. Rate limiting fits the same shape.

The Decision Rule

Back to the original question: you have express-rate-limit on defaults. Should you change it?

Yes, almost certainly. Here’s the rule, stated plainly:

Most APIs → sliding window log. Fair, accurate, no boundary burst, easy to explain on a whiteboard. This is the right answer if you’re not sure.
Above ~1000 RPS per key or burst-tolerant workloads → token bucket. You’ll feel the memory cost of sliding window log before you feel anything else.
Internal services you control end-to-end → fixed window is fine. Don’t overthink it. Pick the simplest thing that meets the SLA.

Whichever you pick: move the state to Redis if you have more than one Node instance, set the four headers, and key by user-or-IP-or-endpoint depending on what’s being protected. That’s the whole job.

Ship the change behind a feature flag, watch your 429 rate for a week, and you’ll know if you got the key right and the limit right. Rate limiting is one of the few places in software where the right answer shows up in your metrics within days — not months. The default is almost never the right answer. Now you know which one is.