Distributed systems

Traffic management

End-to-end design

Advanced

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Design worldwide quotas without putting a globally serialized dependency in the request path, using hierarchical allocation, leased budgets, and bounded overshoot.

Traffic management sits under Distributed systems , so the page stays concrete about local mechanics without losing the larger distributed-systems context.

global-quotas

budget-allocation

multi-region

hierarchical-limits

leases

fairness

Family

Distributed systems → Traffic management

Admission control, protection mechanisms, and shaping request flow before downstream systems melt.

Builds on

2 topics

These pages provide the mental model or mechanism that this design assumes.

Related directions

Nearby topics help compare alternative mechanisms without flattening everything into one answer.

Learning paths

Follow a curated path when you want the surrounding systems context instead of a single isolated deep dive.

Problem

A local rate limiter is easy compared with this requirement:

limit tenant T to 5 million requests per minute worldwide across 10 regions and thousands of instances

The naive answer is:

keep one global counter and decrement it for every request

That creates a critical-path dependency that is:

high latency
high blast radius
hard to scale
vulnerable to partitions

Global quotas are a design exercise in hierarchical budget allocation, not a reason to serialize every request across the planet.

First principle

For large systems, the request path should usually spend from a local lease, not from the absolute global truth.

That means a quota system is layered:

global configured quota
  -> regional budget
    -> instance-local lease
      -> per-request spend

The request path becomes local and cheap.

The trade-off is bounded overshoot.

What is the actual guarantee?

Be explicit.

Possible guarantees:

Hard quota

Never exceed the global cap.

This is very expensive operationally and often incompatible with a low-latency multi-region path.

Soft quota with bounded overshoot

May exceed quota by at most the sum of outstanding leases plus reconciliation delay.

This is the practical production target.

Global capacity is shared according to demand and priority, not just a fixed cap.

This is often what internal platforms really need.

Recommended architecture

Split into:

quota control plane
global allocator
regional quota stores
local lease caches in gateways/services

policy store
  -> allocator
      -> regional assignments
          -> regional state store
              -> node-local leases
                  -> request path

Control plane

Stores:

quota definitions
subject dimensions
priority
burst reserve policy
fail-open / fail-closed behavior

Allocator

Periodically decides how much budget each region may spend.

Regional store

Tracks currently available regional budget and issues leases to local nodes.

Node-local lease cache

Consumes requests from already leased tokens without remote dependency per call.

Global quota object

A quota usually needs:

{
  "quota_id": "tenant-123-api-read",
  "scope": "tenant:123",
  "unit": "request",
  "global_limit_per_minute": 5000000,
  "burst_reserve": 500000,
  "priority": "standard",
  "allocation_policy": "demand_weighted",
  "fail_mode": "closed"
}

There may also be multiple nested quotas:

per-tenant global
per-product global
per-endpoint global

Those should not all be enforced with strong cross-key atomicity in the request path unless the product absolutely requires it.

Allocation loop

The allocator runs every interval, such as every 500 ms or every few seconds.

Inputs:

configured global limit
current regional demand
outstanding leases by region
region health
backlog / queueing / rejection rates

Outputs:

new regional available budget
optional reserve transfers

Allocation strategies

Static weighted split

regional_budget[r] = global_quota * weight[r]

Simple and stable, but wastes capacity when traffic distribution is uneven.

Demand-weighted split

regional_budget[r] = global_quota * smoothed_demand[r] / sum(smoothed_demand)

Higher efficiency, but can oscillate if the demand signal is noisy.

Use smoothing:

smoothed_demand = alpha * recent + (1 - alpha) * previous

Base plus reserve

Give every region:

a baseline allocation
access to a shared reserve pool

This is a strong default when you want both predictability and elasticity.

Regional leasing

Inside each region, the quota store issues leases to local nodes.

Example:

regional budget = 50,000 tokens
node A lease = 2,000
node B lease = 2,000
...

Nodes spend locally and only contact the regional store when renewing or acquiring more lease.

Benefits:

lower request-path latency
much lower regional state-store QPS
resilience to short control-plane hiccups

Trade-off:

overshoot bound grows with lease size and number of active nodes

Overshoot math

If each of N nodes can hold up to L outstanding tokens, then node-local overshoot is bounded roughly by:

N * L

Add regional allocator lag and in-flight requests, and total overshoot becomes:

regional_outstanding_leases
+ allocator_interval_drift
+ in_flight_unreported_usage

This is the right engineering conversation.

Do not claim exactness if the system is explicitly using leases.

Lease sizing

Large leases:

lower coordination overhead
increase overshoot
slow fair redistribution

Small leases:

improve accuracy
increase regional store QPS
cause more renewal chatter

A practical strategy is adaptive lease sizing:

small for low-volume or near-limit subjects
larger for steady high-throughput subjects

Near-depletion behavior

As a quota approaches exhaustion, the system should tighten control:

issue smaller leases
increase refresh frequency
maybe disable opportunistic burst reserve

This reduces end-of-window overshoot and improves the quality of the final decisions.

Burst reserve

Many systems want:

steady base allocation
temporary bursts for hot regions

Use a reserve pool:

global_limit = base_allocations + reserve_pool

Regions request reserve only when local demand exceeds base.

Reserve grants should have:

expiry
priority ordering
fairness policy

Otherwise the busiest region can starve everyone else.

Failure modes

Allocator unavailable

Regions continue spending existing assigned budget until it expires.

Regional store unavailable

Nodes spend existing local lease; new lease issuance pauses.

One region partitioned

That region may continue on stale budget until local leases expire. This is why local budgets must be bounded and expiring.

Region death

Allocator should reclaim its unused future budget and redistribute elsewhere.

This is why allocator state must distinguish:

assigned
leased
spent
expired / reclaimable

Reconciliation

You need a slower reconciliation path to compare:

expected spend
lease issuance
observed decision logs
regional reports

This catches:

accounting drift
buggy clients
allocator bugs
nodes that overspent stale lease

Request-path performance and accounting accuracy usually live on different time horizons, so they deserve different pipelines.

Hard vs soft deny

When quota is nearly exhausted, decisions can become policy-aware:

premium traffic gets remaining reserve
background or low-priority work is cut first
regional fairness caps apply

This turns quota from a blunt binary tool into an actual capacity management system.

Relationship to rate limiting

A rate limiter is a mechanism for local enforcement.

A global quota system is the policy and allocation layer above that mechanism.

The usual combination is:

quota allocator decides budgets
regional limiter enforces assigned regional budget
local nodes spend from leases

That is why global quotas belong in the same learning path as rate limiting but are not the same design.

Relationship to distributed locking

You do not usually need a global lock for every spend.

But you may need coordination for:

allocator leadership
ownership of quota shard processing
mutually exclusive policy mutation workflows

That is where a lock or leadership primitive may appear, but it should stay off the request path.

Observability

Track:

configured quota
assigned regional budget
outstanding regional leases
outstanding node leases
estimated overshoot bound
denied requests by quota and region
reserve utilization
reallocation frequency

Operators should be able to answer:

which region is exhausting quota
whether the issue is real spend or lease fragmentation
how much overshoot risk currently exists

Common mistakes

1. globally serialized request-path spend

This creates a fragile worldwide dependency.

2. no explicit overshoot model

Then the design silently violates its own contract.

3. huge leases near quota exhaustion

This causes ugly end-of-window bursts.

4. static regional split for highly dynamic traffic

This wastes capacity and causes false denial.

5. no reconciliation loop

Then you trust approximate issuance forever with no correction path.

What the senior answer sounds like

I would not enforce a worldwide quota by decrementing one global counter in the request path. Instead, I would use hierarchical budgets: a global allocator periodically assigns regional budgets based on configured quota, demand, and health; each region issues expiring leases to local nodes; and the request path spends from those local leases. That keeps request latency regional and scalable while giving an explicit bounded-overshoot model. Lease size should shrink near quota exhaustion, and I would maintain a slower reconciliation pipeline to compare assigned, leased, and spent usage. The key trade-off is correctness versus latency: I would choose bounded eventual correctness for most high-throughput APIs and reserve globally serialized hard quotas only for domains where that cost is truly justified.

Key takeaways

Global quotas are usually a hierarchical allocation problem, not a global counter problem.
Keep the request path local by spending from regional and node-local leases.
Be explicit about bounded overshoot; that is the real correctness contract.
Use smoothing, reserve pools, and adaptive lease sizing to balance efficiency and fairness.
Separate fast-path allocation from slow-path reconciliation and accounting.

Included paths

Use these routes when you want this page to stay anchored inside a larger systems-learning progression.

Traffic control core

Start with bucket math, then move into rate limiting, reliability controls, feedback loops, and saturation management.

Global policy enforcement

Learn how policy definition, distributed enforcement, and multi-region coordination fit together for large control surfaces.

Control loops and stability

Connect integrators, PI-style controllers, hysteresis, anti-windup, and oscillation detection to the distributed systems that use them.

Build on these first

These pages supply the mechanism or vocabulary that this design assumes.

Traffic management

End-to-end design

Designing a Rate Limiter (at Scale, Production-Grade)

Design a limiter that is actually deployable: low-latency enforcement, burst handling, distributed quotas, multi-region coordination, and failure-safe behavior.

Control plane

Trade-off

Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

Design distributed locking with explicit guarantees, stale-owner protection, and realistic failure semantics instead of assuming a lock magically creates correctness.

Related directions

These topics live nearby conceptually, even if they are not strict prerequisites.

Traffic management

Foundation

Token Bucket, GCRA, and Virtual Time

Understand token-based rate limiting mathematically: saturated integrators, debt-space duals, and why token bucket and GCRA are the same policy in different coordinates.

Traffic management

End-to-end design

Load Shedding (Protecting Latency Under Saturation)

Design admission control that drops the right work at the right time, using concurrency, queue depth, cost, and priority instead of letting the service fail slowly.

Control plane

End-to-end design

Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts)

Design a feature flag platform that supports low-latency local evaluation, strong auditability, deterministic targeting, and safe configuration rollouts across a fleet.

Reliability

Building block

Feedback Control for Autoscaling and Load Shedding

Use PI/PID ideas the way production systems actually do: filtered signals, clamped actions, weak predictive bias, and layered controllers instead of textbook loops.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Traffic control core

Start with bucket math, then move into rate limiting, reliability controls, feedback loops, and saturation management.

Token Bucket, GCRA, and Virtual Time Designing a Rate Limiter (at Scale, Production-Grade) Global Quotas (Hierarchical Budgets Across Regions and Fleets) Load Shedding (Protecting Latency Under Saturation) Circuit Breakers (State Machines, Hysteresis, and Fast Failure) Feedback Control for Autoscaling and Load Shedding Idempotency and Retries (Without Multiplying Load) Anti-Windup, Hysteresis, and Oscillation in Distributed Control Loops

Global policy enforcement

Learn how policy definition, distributed enforcement, and multi-region coordination fit together for large control surfaces.

Designing a Rate Limiter (at Scale, Production-Grade) Global Quotas (Hierarchical Budgets Across Regions and Fleets) Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts) Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

Control loops and stability

Connect integrators, PI-style controllers, hysteresis, anti-windup, and oscillation detection to the distributed systems that use them.

Token Bucket, GCRA, and Virtual Time Designing a Rate Limiter (at Scale, Production-Grade) Global Quotas (Hierarchical Budgets Across Regions and Fleets) Load Shedding (Protecting Latency Under Saturation) Feedback Control for Autoscaling and Load Shedding Anti-Windup, Hysteresis, and Oscillation in Distributed Control Loops

From the blog

Pair the atlas with the broader engineering writing on the site when you want editorial context around the systems mechanisms.

Don't eat too much red meat.

Too much iron ?

January 24, 2026

3 min read

myth science

Linux is just that much better

Yes, even on Apple hardware.

December 9, 2025

2 min read

linux macos

Git rebase is always the wrong choice.

Please, stop making everyone suffer.

December 6, 2025

4 min read

git

Problem

First principle

What is the actual guarantee?

Hard quota

Soft quota with bounded overshoot

Fair-share quota

Recommended architecture

Control plane

Allocator

Regional store

Node-local lease cache

Global quota object

Allocation loop

Allocation strategies

Static weighted split

Demand-weighted split

Base plus reserve

Regional leasing

Overshoot math

Lease sizing

Near-depletion behavior

Burst reserve

Failure modes

Allocator unavailable

Regional store unavailable

One region partitioned

Region death

Reconciliation

Hard vs soft deny

Relationship to rate limiting

Relationship to distributed locking

Observability

Common mistakes

1. globally serialized request-path spend

2. no explicit overshoot model

3. huge leases near quota exhaustion

4. static regional split for highly dynamic traffic

5. no reconciliation loop

What the senior answer sounds like

Key takeaways

Included paths

Traffic control core

Global policy enforcement

Control loops and stability

Build on these first

Designing a Rate Limiter (at Scale, Production-Grade)

Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

Related directions

Token Bucket, GCRA, and Virtual Time

Load Shedding (Protecting Latency Under Saturation)

Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts)

Feedback Control for Autoscaling and Load Shedding

More from Traffic management

Token Bucket, GCRA, and Virtual Time

Designing a Rate Limiter (at Scale, Production-Grade)

Load Shedding (Protecting Latency Under Saturation)

Paths that include this topic

Traffic control core

Global policy enforcement

Control loops and stability

From the blog

Don't eat too much red meat.

Linux is just that much better

Git rebase is always the wrong choice.