Rate Limiting Algorithms — Interactive Visual Guide & Python Implementations

Why Rate Limiting?

Rate limiting is a critical technique used to control the rate of requests that clients can make to an API or service. It protects your systems from being overwhelmed, ensures fair usage among clients, and prevents abuse.

🛡️

Protection

Prevent DDoS attacks and resource exhaustion

⚖️

Fairness

Ensure equal access for all clients

💰

Cost Control

Manage infrastructure costs effectively

API Server

Healthy

Token Bucket

The most flexible rate limiter — allows controlled bursts

How it works

A bucket holds tokens up to a maximum capacity
Tokens are added at a constant refill rate
Each request consumes one token
If no tokens available, the request is rejected
Allows bursty traffic when tokens have accumulated

💡 Key Insight

Token Bucket is widely used (e.g., by Amazon and Stripe) because it smooths out traffic while still allowing short bursts. If a client hasn't made requests in a while, they can "spend" their accumulated tokens in a burst.

Parameters

Bucket Capacity 5

Refill Rate (tokens/sec) 1

🐍 View Python Implementation

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = float(capacity)
        self.last_refill = time.time()

    def allow_request(self) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        
        # Refill tokens based on elapsed time
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now
        
        if self.tokens >= 1.0:
            self.tokens -= 1.0
            return True
        return False

Token Bucket

0 / 5

💧 Refilling...

Request Log

Leaking Bucket

Smooths out bursty traffic — processes requests at a constant rate

How it works

Requests are added to a queue (bucket)
The queue has a fixed size; if full, new requests are dropped
Requests are processed (leaked) from the queue at a constant rate
This ensures a steady output rate regardless of input burstiness

💡 Key Insight

Unlike Token Bucket, Leaking Bucket does not allow bursts — it always processes at a constant rate. Think of it like water dripping from a leaky bucket: no matter how fast you pour water in, it drips out at the same pace.

Parameters

Queue Size 5

Leak Rate (req/sec) 1

🐍 View Python Implementation

import time
from collections import deque

class LeakingBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate # requests per second
        self.queue = deque()
        self.last_leak = time.time()

    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        leaked = int(elapsed * self.leak_rate)
        
        if leaked > 0:
            # Process (leak) requests from queue
            for _ in range(min(leaked, len(self.queue))):
                self.queue.popleft()
            self.last_leak = now

    def add_request(self) -> bool:
        self._leak()
        if len(self.queue) < self.capacity:
            self.queue.append(time.time())
            return True
        return False

Leaking Bucket Queue

0 / 5

Processing at constant rate

✅ Processed

Request Log

Fixed Window Counter

Simple and memory-efficient — counts requests in fixed time windows

How it works

Time is divided into fixed windows (e.g., every 10 seconds)
Each window has a counter that starts at zero
Each request increments the counter
If the counter exceeds the limit, requests are rejected
When a new window begins, the counter resets to zero

⚠️ The Boundary Problem

The main weakness: a burst of requests at the end of one window and the start of the next can effectively allow 2x the rate limit. Watch the visualization to see this in action!

Parameters

Window Size (seconds) 10

Max Requests per Window 5

🐍 View Python Implementation

import time

class FixedWindowCounter:
    def __init__(self, limit: int, window_size: float):
        self.limit = limit
        self.window_size = window_size
        self.count = 0
        self.window_start = time.time()

    def allow_request(self) -> bool:
        now = time.time()
        
        if now - self.window_start >= self.window_size:
            # Reset the counter and slide window start
            self.count = 0
            self.window_start = now
            
        if self.count < self.limit:
            self.count += 1
            return True
        return False

Current Window 10s

0 / 5

Window History

Request Log

Sliding Window Log

Precise rate limiting — tracks every request timestamp

How it works

Keeps a log of timestamps for each request
When a new request arrives, remove outdated entries (older than window size)
Count remaining entries — if count < limit, allow and add timestamp
If count ≥ limit, reject the request
The window slides continuously — no boundary issues!

💡 Key Insight

This solves the Fixed Window boundary problem but requires more memory — it stores every request timestamp. The trade-off is precision vs. memory usage.

Parameters

Window Size (seconds) 10

Max Requests per Window 5

🐍 View Python Implementation

import time

class SlidingWindowLog:
    def __init__(self, limit: int, window_size: float):
        self.limit = limit
        self.window_size = window_size
        self.timestamps = []

    def allow_request(self) -> bool:
        now = time.time()
        cutoff = now - self.window_size
        
        # Filter out expired timestamps
        self.timestamps = [t for t in self.timestamps if t >= cutoff]
        
        if len(self.timestamps) < self.limit:
            self.timestamps.append(now)
            return True
        return False

Requests in window: 0 / 5

Stored timestamps: 0

Timestamp Log

Request Log

Sliding Window Counter

Best of both worlds — combines Fixed Window efficiency with Sliding Window accuracy

How it works

Uses two adjacent fixed windows (previous + current)
Calculates a weighted count based on the overlap
Formula: weighted = prev_count × overlap% + current_count
If weighted count < limit → allow, else → reject
Achieves sliding window accuracy with minimal memory

💡 Key Insight

This is the sweet spot for most use cases. You only store two counters instead of every timestamp, yet get much better precision than fixed windows. Used by Cloudflare and many production systems.

Parameters

Window Size (seconds) 10

Max Requests per Window 5

🐍 View Python Implementation

import time

class SlidingWindowCounter:
    def __init__(self, limit: int, window_size: float):
        self.limit = limit
        self.window_size = window_size
        self.prev_count = 0
        self.current_count = 0
        self.window_start = time.time()

    def allow_request(self) -> bool:
        now = time.time()
        elapsed = now - self.window_start
        
        if elapsed >= self.window_size:
            # Shift window counters
            self.prev_count = self.current_count
            self.current_count = 0
            self.window_start = now
            elapsed = 0.0
            
        # Estimate requests based on overlap weight
        overlap = 1.0 - (elapsed / self.window_size)
        weighted_count = self.prev_count * overlap + self.current_count
        
        if weighted_count < self.limit:
            self.current_count += 1
            return True
        return False

Previous Window

Weight: 0%

Current Window

10s

Weighted Count = 0 × 100% + 0 = 0

Status: ✅ Under limit (0 / 5)

Request Log

HTTP Response Standards

How production APIs communicate rate limits to clients

X-RateLimit-Limit Standard

Specifies the maximum number of requests allowed within the current rate limiting window.

X-RateLimit-Limit: 100

X-RateLimit-Remaining Standard

Indicates the number of requests remaining in the current window before a block occurs.

X-RateLimit-Remaining: 42

X-RateLimit-Reset Standard

Specifies the Unix epoch timestamp indicating exactly when the current window resets and counters clear.

X-RateLimit-Reset: 1719747600

Retry-After Block Header

Sent during a block. Tells the client exactly how many seconds to wait before trying the request again.

Retry-After: 30

🚨

HTTP Status Code 429: Too Many Requests

When a client exceeds their rate limit, servers reject the connection with an HTTP Status Code 429 Too Many Requests. Standardized under RFC 6585, clients should back off and observe the Retry-After header to prevent getting banned.

Algorithm Comparison

Understanding the trade-offs to choose the right algorithm

Feature	Token Bucket	Leaking Bucket	Fixed Window	Sliding Log	Sliding Counter
Memory Usage	Low	Low	Low	High	Low
Accuracy	High	High	Medium	Excellent	High
Burst Handling	Allows Bursts	Smooths Out	Boundary Issue	Precise	Good
Implementation	Simple	Simple	Simplest	Complex	Moderate
Used By	Amazon, Stripe	Network Routers	Simple APIs	Financial Systems	Cloudflare
Best For	APIs with burst tolerance	Steady output needed	Simple use cases	Strict compliance	Most production systems

Which algorithm should you use?

🏎️

Need burst tolerance?

Use Token Bucket

🚰

Need steady output?

Use Leaking Bucket

🎯

Need simplicity?

Use Fixed Window Counter

🔬

Need precision?

Use Sliding Window Log

⚡

Need balance?

Use Sliding Window Counter

Frequently Asked Questions

Got questions about rate limiting? We've got answers.

What is the difference between rate limiting and throttling?

Rate limiting generally refers to hard limits placed on requests (e.g., maximum 100 requests per minute), where any requests above the limit are rejected immediately with an HTTP 429 error.

Throttling is the process of slowing down request execution or limiting bandwidth/processing rates dynamically rather than failing outright. Leaking Bucket is a good example of a throttling rate limiter.

Why is Redis commonly used for distributed rate limiters?

In distributed environments with multiple server instances, rate limiter state must be shared. Redis is ideal because:

It stores state in-memory for microsecond operations.
It supports atomic operations (like INCR and Lua scripts) to prevent race conditions.
It supports expiring keys via EXPIRE, which makes implementing Fixed Window rate limiters straightforward.

How does the Sliding Window Counter algorithm improve on the Fixed Window approach?

The Fixed Window Counter algorithm suffers from a boundary problem where a client can make double their limit at the edge of window resets. For instance, if the limit is 10/min, they can send 10 requests at 0:59 and another 10 at 1:01, resulting in 20 requests in 2 seconds.

The Sliding Window Counter solves this by taking a weighted average of the request count from both the previous and current window, smoothing out the rate calculation without storing all request timestamps like a Sliding Log would.

Is there an RFC standard for rate limiting headers?

While headers like X-RateLimit-* are widely used by services like GitHub and Stripe, they are not standard RFCs. However, the IETF is working on standardizing rate limit headers under RFC 9126, suggesting response headers like RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset (replacing the "X-" prefix).