Distributed systems

Reliability

Building block

Advanced

Circuit Breakers (State Machines, Hysteresis, and Fast Failure)

Design circuit breakers that actually stabilize a fleet: rolling windows, half-open probes, dependency-scoped state, and clean interaction with retries and load shedding.

Reliability sits under Distributed systems , so the page stays concrete about local mechanics without losing the larger distributed-systems context.

circuit-breaker

timeouts

half-open

hysteresis

resilience

dependency-isolation

Family

Distributed systems → Reliability

Timeouts, retries, graceful degradation, and fault isolation under partial failure.

Builds on

1 topic

These pages provide the mental model or mechanism that this design assumes.

Related directions

Nearby topics help compare alternative mechanisms without flattening everything into one answer.

Learning paths

Follow a curated path when you want the surrounding systems context instead of a single isolated deep dive.

Problem

When a dependency starts failing, naive clients do the worst possible thing:

keep sending traffic
wait too long for timeouts
retry aggressively
saturate their own worker pool
amplify the outage

A circuit breaker exists to fail fast once the dependency is already known bad.

The goal is not decorative resilience. The goal is to stop one unhealthy dependency from converting into:

thread pool exhaustion
queue growth
event-loop lag
retry storms
system-wide cascading failure

State machine

The canonical breaker has three states:

CLOSED -> OPEN -> HALF_OPEN -> CLOSED

CLOSED

Traffic flows normally and outcomes are observed.

OPEN

Requests are rejected immediately or sent to fallback.

HALF_OPEN

A small number of probe requests are allowed through to test recovery.

This is the right abstraction, but the implementation details decide whether the breaker stabilizes the system or flaps uselessly.

The breaker key

Never use one giant global breaker.

Breakers should usually be scoped by:

caller service
callee dependency
operation / endpoint
sometimes region or tenant class

Example:

payments-api -> fraud-service -> POST /score

Why this matters:

failures are often endpoint-specific
one degraded method should not blackhole all traffic
blast radius should match the actual fault domain

Tripping logic

The worst implementation is:

if 5 failures in a row => open

This is too brittle. It ignores traffic volume, latency, and error ratios.

Better signal: rolling statistical window

Track over a recent horizon:

total requests
failures
timeouts
latency percentiles
local saturation signals

Then trip on a minimum-volume threshold plus ratio/latency policy.

Example:

window = 10s rolling
minimum volume = 100 requests
open if:
  failure_ratio > 50%
  OR timeout_ratio > 30%
  OR p99_latency > 2s for sustained interval

The minimum-volume guard prevents one or two unlucky failures from opening rarely used endpoints.

Hysteresis

Breakers need different thresholds for opening and closing.

Without hysteresis:

dependency partially recovers
breaker closes too early
traffic floods back
dependency fails again
breaker reopens

That oscillation is itself a failure mode.

A practical rule:

open on strong evidence of failure
close only after probe success plus a cooldown

This is the same reason control systems avoid using one threshold for both directions.

Half-open behavior

Half-open is where many weak designs go wrong.

Bad version:

after 10 seconds, let all traffic through again

That is not a probe. That is a surge.

Use a small controlled budget:

allow N concurrent probes or M requests/s in HALF_OPEN

If enough probes succeed:

transition to CLOSED

If probes fail:

transition back to OPEN

This keeps recovery testing cheap and prevents synchronized re-flooding.

Interaction with timeouts

A breaker without strict timeouts is incomplete.

Requests must have:

connection timeout
request timeout
total deadline

If timeouts are too long, the breaker trips too late because the caller is already holding scarce resources for too long.

If timeouts are too short, the breaker learns fake failure from an otherwise healthy but variable dependency.

The timeout should reflect:

normal latency distribution
queueing model
user-facing deadline
cost of retry

Interaction with retries

Correct layering:

apply short timeout
consult breaker
if CLOSED, execute
if retryable failure, maybe retry within budget
update breaker metrics from outcomes

The breaker should gate retries too.

Once OPEN:

new first attempts should fail fast
retries should fail fast

Otherwise the retry layer bypasses the protection.

Fallbacks

Open does not have to mean “return 500.”

Possible fallbacks:

stale cache
partial response
downgraded feature mode
asynchronous acceptance instead of synchronous completion

Examples:

recommendation service breaker opens -> serve cached recommendations
feature flag dependency opens -> use last-known config snapshot
avatar service opens -> return default avatar URL

But be careful: bad fallbacks can hide real outages too well and delay detection.

Distributed vs local breaker state

Use local process breaker state for the hot path.

Why local?

it is cheap
it reacts to the caller’s actual experience
it avoids putting another dependency in front of every call

Do not centralize breaker state unless there is a strong reason. Centralized breaker systems are often slower, less accurate, and create shared failure modes.

What can be centralized:

breaker configuration
thresholds
feature gates
telemetry aggregation

What should remain local:

rolling windows
current open/closed decision
half-open probe accounting

Latency-aware breaking

Error rate alone is not enough.

A dependency can still return 200 while becoming operationally toxic due to latency.

Example:

success rate stays 98%
p99 latency jumps from 40 ms to 4 s
caller thread pool exhausts anyway

That is why many systems trip on:

failure ratio
timeout ratio
or extreme tail latency

Some systems even use saturation signals like queueing delay or outstanding concurrency as a trip input.

Example local data structure

state
opened_at
cooldown_until
rolling_success_count
rolling_failure_count
rolling_timeout_count
rolling_latency_histogram
half_open_probe_limit
half_open_inflight

The rolling window can be implemented with:

fixed time buckets
exponentially decayed counters
ring buffers per second

Fixed buckets are usually simple and good enough.

Pseudocode

allow_request():
  if state == OPEN and now < cooldown_until:
    return REJECT_FAST

  if state == OPEN and now >= cooldown_until:
    state = HALF_OPEN

  if state == HALF_OPEN:
    if half_open_inflight >= probe_limit:
      return REJECT_FAST
    half_open_inflight += 1
    return ALLOW_PROBE

  return ALLOW_NORMAL

record_outcome(result):
  update rolling metrics

  if state == CLOSED and should_open(metrics):
    state = OPEN
    cooldown_until = now + open_interval

  else if state == HALF_OPEN:
    half_open_inflight -= 1
    if should_reopen(result, metrics):
      state = OPEN
      cooldown_until = now + open_interval
    else if enough_probe_success(metrics):
      state = CLOSED

In a large fleet, thousands of processes can make the same breaker decision at once.

Use jitter in:

open cooldown duration
half-open probe timing

Otherwise all instances probe simultaneously and recreate a thundering herd at recovery time.

Interaction with load shedding

Circuit breakers protect against unhealthy dependencies.

Load shedding protects against local saturation.

These should cooperate:

dependency breaker opens -> fewer doomed calls consume resources
local load shedding activates -> low-priority requests never reach expensive dependency path

This is the combination that keeps a service alive under partial failure plus overload.

Operational metrics

Track:

opens per dependency
time spent open
half-open success ratio
fast-fail count
fallback count
latency before and after breaker activation

If breakers are always open, the dependency is down.

If breakers flap constantly, the thresholds are wrong or the dependency is oscillating.

If breakers never open during known incidents, they are probably too conservative.

Common mistakes

1. one breaker for an entire dependency

This causes unnecessary blast radius.

2. opening on tiny sample sizes

That creates noisy false positives.

3. no half-open limit

Recovery turns into a surge.

4. ignoring latency and only counting HTTP 500s

You miss the dependency that is “technically up” but operationally unusable.

5. retries outside the breaker

This nullifies the protection.

What the senior answer sounds like

I would implement circuit breakers as local per-dependency, per-operation state machines with rolling-window metrics, minimum-volume guards, and hysteresis. The breaker should open on sustained failure, timeout, or extreme latency rather than a few consecutive bad calls. In OPEN it should fail fast, and in HALF_OPEN it should allow only a small probe budget with jitter so recovery does not trigger a herd. Breakers must compose with timeouts, retries, and load shedding: short deadlines feed the breaker accurate signals, retries are bounded and blocked once the breaker is open, and local overload controls stop low-value work before it consumes scarce resources. Configuration can be centralized, but the decision itself should stay local to avoid another critical-path dependency.

Key takeaways

A circuit breaker is a state machine plus statistical decision rule, not an if-statement on recent failures.
Use rolling windows, minimum volume, hysteresis, and limited half-open probes.
Scope breakers to the real fault domain: dependency plus operation, not the whole world.
Pair breakers with timeouts, retries, and load shedding or they will not stabilize the system.
Keep breaker decisions local; centralize config and telemetry, not the hot-path state.

Included paths

Use these routes when you want this page to stay anchored inside a larger systems-learning progression.

Traffic control core

Start with bucket math, then move into rate limiting, reliability controls, feedback loops, and saturation management.

Build on these first

These pages supply the mechanism or vocabulary that this design assumes.

Reliability

Building block

Idempotency and Retries (Without Multiplying Load)

Build a retry stack that survives crashes, duplicate delivery, and partial completion without turning transient failure into write amplification and data corruption.

What this enables

Once the current design feels natural, these are the best next systems to tackle.

Traffic management

End-to-end design

Load Shedding (Protecting Latency Under Saturation)

Design admission control that drops the right work at the right time, using concurrency, queue depth, cost, and priority instead of letting the service fail slowly.

Reliability

Building block

Feedback Control for Autoscaling and Load Shedding

Use PI/PID ideas the way production systems actually do: filtered signals, clamped actions, weak predictive bias, and layered controllers instead of textbook loops.

Related directions

These topics live nearby conceptually, even if they are not strict prerequisites.

Traffic management

End-to-end design

Designing a Rate Limiter (at Scale, Production-Grade)

Design a limiter that is actually deployable: low-latency enforcement, burst handling, distributed quotas, multi-region coordination, and failure-safe behavior.

Traffic management

End-to-end design

Load Shedding (Protecting Latency Under Saturation)

Design admission control that drops the right work at the right time, using concurrency, queue depth, cost, and priority instead of letting the service fail slowly.

Reliability

Building block

Feedback Control for Autoscaling and Load Shedding

Use PI/PID ideas the way production systems actually do: filtered signals, clamped actions, weak predictive bias, and layered controllers instead of textbook loops.

Reliability

Trade-off

Anti-Windup, Hysteresis, and Oscillation in Distributed Control Loops

Stabilize real control loops under delay and saturation: clamp integrators, separate thresholds, detect oscillation cheaply, and adapt gains before the system starts flapping.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Traffic control core

Start with bucket math, then move into rate limiting, reliability controls, feedback loops, and saturation management.

Token Bucket, GCRA, and Virtual Time Designing a Rate Limiter (at Scale, Production-Grade) Global Quotas (Hierarchical Budgets Across Regions and Fleets) Load Shedding (Protecting Latency Under Saturation) Circuit Breakers (State Machines, Hysteresis, and Fast Failure) Feedback Control for Autoscaling and Load Shedding Idempotency and Retries (Without Multiplying Load) Anti-Windup, Hysteresis, and Oscillation in Distributed Control Loops

From the blog

Pair the atlas with the broader engineering writing on the site when you want editorial context around the systems mechanisms.

Don't eat too much red meat.

Too much iron ?

January 24, 2026

3 min read

myth science

Linux is just that much better

Yes, even on Apple hardware.

December 9, 2025

2 min read

linux macos

Git rebase is always the wrong choice.

Please, stop making everyone suffer.

December 6, 2025

4 min read

git

Problem

State machine

CLOSED

OPEN

HALF_OPEN

The breaker key

Tripping logic

Better signal: rolling statistical window

Hysteresis

Half-open behavior

Interaction with timeouts

Interaction with retries

Fallbacks

Distributed vs local breaker state

Latency-aware breaking

Example local data structure

Pseudocode

Preventing modal collapse in large fleets

Interaction with load shedding

Operational metrics

Common mistakes

1. one breaker for an entire dependency

2. opening on tiny sample sizes

3. no half-open limit

4. ignoring latency and only counting HTTP 500s

5. retries outside the breaker

What the senior answer sounds like

Key takeaways

Included paths

Traffic control core

Build on these first

Idempotency and Retries (Without Multiplying Load)

What this enables

Load Shedding (Protecting Latency Under Saturation)

Feedback Control for Autoscaling and Load Shedding

Related directions

Designing a Rate Limiter (at Scale, Production-Grade)

Load Shedding (Protecting Latency Under Saturation)

Feedback Control for Autoscaling and Load Shedding

Anti-Windup, Hysteresis, and Oscillation in Distributed Control Loops

More from Reliability

Feedback Control for Autoscaling and Load Shedding

Idempotency and Retries (Without Multiplying Load)

Anti-Windup, Hysteresis, and Oscillation in Distributed Control Loops

Paths that include this topic

Traffic control core

From the blog

Don't eat too much red meat.

Linux is just that much better

Git rebase is always the wrong choice.