Distributed systems

Control plane

Trade-off

Advanced

Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

Design distributed locking with explicit guarantees, stale-owner protection, and realistic failure semantics instead of assuming a lock magically creates correctness.

Control plane sits under Distributed systems , so the page stays concrete about local mechanics without losing the larger distributed-systems context.

distributed-locking

leases

fencing

consensus

redis

zookeeper

Family

Distributed systems → Control plane

Policy authoring, configuration distribution, versioning, and safe rollout of behavior changes.

Builds on

Standalone

You can read this directly and use it as the starting point for the new track.

Related directions

Nearby topics help compare alternative mechanisms without flattening everything into one answer.

Learning paths

Follow a curated path when you want the surrounding systems context instead of a single isolated deep dive.

Problem

Teams often say:

we need a distributed lock

What they usually mean is one of several different needs:

elect one active worker
prevent concurrent mutation of one resource
serialize a critical section
avoid duplicate job execution
coordinate ownership during failover

Those are not identical problems, and a lock is often the wrong tool.

A staff-level answer starts with:

what guarantee do we actually need, and can we avoid a lock entirely?

What a lock can and cannot guarantee

A distributed lock tries to give:

mutual exclusion
liveness
crash recovery

But in real systems you must also think about:

clock skew
pauses from GC or scheduling
network partitions
stale holders resuming after lease expiry

That last one is the killer.

Even if a lease expires, the original holder may resume and still issue writes unless the protected resource can reject stale ownership.

That is why fencing tokens matter more than the lock itself.

First question: can you avoid locking?

Before designing one, check cheaper alternatives:

Idempotency

If duplicate execution is acceptable as long as side effects are deduped, use idempotency instead of exclusive ownership.

Sharding / single-writer partitioning

If all work for a key routes to one owner, you may not need a lock.

Compare-and-set on the resource itself

If the storage engine supports conditional updates, the resource can often protect itself directly.

CRDT / mergeable state

If concurrent updates can be merged safely, exclusive ownership is unnecessary.

Locks are often the sign that concurrency was not modeled explicitly enough.

When locking is justified

It can be justified when:

one leader should run a background scheduler
exactly one worker should hold an operational role at a time
a scarce external resource cannot defend itself
you need serialized reconfiguration of shared state

Even then, the design must be explicit about stale holders and recovery.

Lease-based lock model

Never assume a lock is permanent.

Use a lease:

resource_id
owner_id
lease_expiration
fencing_token

The holder must renew periodically.

If renewals stop, another owner can acquire the lease.

This handles crash recovery, but it still does not stop the old owner from acting after it wakes up.

Fencing tokens

Every successful acquisition returns a monotonically increasing token:

token = 41

Any write to the protected resource must include that token, and the resource must reject writes with older tokens.

Example:

worker A acquires lock with token 41
worker A pauses for 30 seconds
lease expires
worker B acquires lock with token 42
worker B writes to storage with token 42
worker A wakes up and tries to write with token 41
storage rejects stale token 41

Without fencing, the stale owner can still corrupt state after losing the lock.

This is the single most important technical point in distributed locking.

Where to store the lock

Consensus-backed store

Examples:

ZooKeeper
etcd
Consul (with caveats depending on mode)

Strengths:

clear lease/session semantics
linearizable operations
suitable for leader election and coordination

Weaknesses:

more operational complexity than a cache
not something to put blindly in every high-QPS path

Redis

Useful for coarse coordination or best-effort leadership, but be careful.

A single Redis instance with:

SET key value NX PX ttl

can approximate a lease.

What it does not solve by itself:

stale owner writes after expiry
durability across failover if not configured appropriately
correctness under ambiguous failover timing

So if Redis is used, it should generally be paired with fencing at the protected resource and used only where the residual risk is acceptable.

Redlock and why people argue about it

The reason Redlock is controversial is not that distributed locking is impossible. It is that people often ask a caching layer to provide coordination semantics stronger than the whole system can actually enforce.

The right engineering question is:

what are the failure assumptions?
what is the cost of stale dual ownership?
does the resource itself verify fencing?

If the answer is “dual ownership would corrupt money or metadata,” use a coordination system with linearizable semantics and enforce fencing at the resource boundary.

Lock acquisition flow

With a consensus-backed lease service:

create or renew a session
try to acquire resource key under that session
receive fencing token
perform protected work while renewing session
release or let session expire

Pseudocode:

lease = coordinator.acquire(resource_id, owner_id)
token = lease.fencing_token

while work_remaining:
    coordinator.renew(lease)
    write(resource, token, mutation)

coordinator.release(lease)

The resource write must compare the token against the highest token it has accepted.

Resource-side enforcement

This is the hard requirement many designs miss.

The lock service alone cannot guarantee safety if:

the client pauses
the network partitions
the client continues writing after lease loss

The resource itself must reject stale owners.

Examples:

SQL row

UPDATE jobs
SET owner_token = :token, state = :new_state
WHERE job_id = :id AND owner_token < :token;

Object store metadata

Store highest token in metadata and reject lower tokens.

Downstream service

Require X-Fencing-Token header and compare against stored max token.

Leader election is just specialized locking

For one active scheduler:

elect leader with lease
followers stay idle
leader renews session
on lease loss, leader stops acting immediately

Important: the leader must stop doing work on uncertainty, not just on explicit revocation.

If the renewal path is unhealthy or ambiguous, keep working only if the safety model explicitly allows split brain. Usually it should not.

Time and renewal

Lease renewal interval should be much shorter than lease duration.

Example:

lease duration = 15s
renew every 5s with jitter

Why shorter?

allows multiple missed renewals before expiry
reduces false leadership loss due to transient blips

But if the lease is too long:

failover is slow

If it is too short:

benign pauses trigger churn

This is a classic safety vs failover-latency trade-off.

Common anti-patterns

1. lock without fencing

This is the biggest one.

2. assuming process-local unlock always happens

Crash and pause are normal failure modes.

3. using a lock to hide missing idempotency

Usually the wrong abstraction.

4. placing locking in a high-QPS request path

Lock services are coordination tools, not generic data-plane databases.

5. treating “I wrote a key in Redis” as the whole correctness argument

It is not.

Observability

Track:

acquisition latency
renewal failures
lease loss count
dual-owner suspicion events
stale-token write rejections
lock hold duration

If stale-token rejections spike, either pause times are large or lease durations are wrong.

What the senior answer sounds like

I would avoid distributed locking unless the problem truly requires exclusive ownership. If it does, I would model it as a lease plus fencing token, not as a magical mutex. The coordinator should give monotonically increasing fencing tokens, and the protected resource must reject writes from stale holders using those tokens. That is what protects correctness under pause, partition, and delayed delivery. For high-value coordination like leader election or serialized control-plane mutation, I would prefer a consensus-backed store with explicit session semantics. Redis can be acceptable for best-effort coordination, but only when the residual stale-owner risk is acceptable and the resource still enforces fencing.

Key takeaways

The real problem is rarely “get a lock”; it is define the guarantee and failure model.
A lease without fencing tokens does not stop stale owners from writing.
Prefer idempotency, sharding, or conditional writes when they solve the problem more directly.
Use consensus-backed coordination for strong safety needs; use Redis only with eyes open.
Keep locking off the hot path unless coordination itself is the product.

Included paths

Use these routes when you want this page to stay anchored inside a larger systems-learning progression.

Global policy enforcement

Learn how policy definition, distributed enforcement, and multi-region coordination fit together for large control surfaces.

What this enables

Once the current design feels natural, these are the best next systems to tackle.

Traffic management

End-to-end design

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Design worldwide quotas without putting a globally serialized dependency in the request path, using hierarchical allocation, leased budgets, and bounded overshoot.

Related directions

These topics live nearby conceptually, even if they are not strict prerequisites.

Traffic management

End-to-end design

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Design worldwide quotas without putting a globally serialized dependency in the request path, using hierarchical allocation, leased budgets, and bounded overshoot.

Control plane

End-to-end design

Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts)

Design a feature flag platform that supports low-latency local evaluation, strong auditability, deterministic targeting, and safe configuration rollouts across a fleet.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Global policy enforcement

Learn how policy definition, distributed enforcement, and multi-region coordination fit together for large control surfaces.

Designing a Rate Limiter (at Scale, Production-Grade) Global Quotas (Hierarchical Budgets Across Regions and Fleets) Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts) Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

From the blog

Pair the atlas with the broader engineering writing on the site when you want editorial context around the systems mechanisms.

Don't eat too much red meat.

Too much iron ?

January 24, 2026

3 min read

myth science

Linux is just that much better

Yes, even on Apple hardware.

December 9, 2025

2 min read

linux macos

Git rebase is always the wrong choice.

Please, stop making everyone suffer.

December 6, 2025

4 min read

git

Problem

What a lock can and cannot guarantee

First question: can you avoid locking?

Idempotency

Sharding / single-writer partitioning

Compare-and-set on the resource itself

CRDT / mergeable state

When locking is justified

Lease-based lock model

Fencing tokens

Where to store the lock

Consensus-backed store

Redis

Redlock and why people argue about it

Lock acquisition flow

Resource-side enforcement

SQL row

Object store metadata

Downstream service

Leader election is just specialized locking

Time and renewal

Common anti-patterns

1. lock without fencing

2. assuming process-local unlock always happens

3. using a lock to hide missing idempotency

4. placing locking in a high-QPS request path

5. treating “I wrote a key in Redis” as the whole correctness argument

Observability

What the senior answer sounds like

Key takeaways

Included paths

Global policy enforcement

What this enables

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Related directions

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts)

More from Control plane

Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts)

Paths that include this topic

Global policy enforcement

From the blog

Don't eat too much red meat.

Linux is just that much better

Git rebase is always the wrong choice.