Distributed systems

Control plane

End-to-end design

Advanced

Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts)

Design a feature flag platform that supports low-latency local evaluation, strong auditability, deterministic targeting, and safe configuration rollouts across a fleet.

Control plane sits under Distributed systems , so the page stays concrete about local mechanics without losing the larger distributed-systems context.

feature-flags

control-plane

rollouts

xds

targeting

configuration

Family

Distributed systems → Control plane

Policy authoring, configuration distribution, versioning, and safe rollout of behavior changes.

Builds on

Standalone

You can read this directly and use it as the starting point for the new track.

Related directions

Nearby topics help compare alternative mechanisms without flattening everything into one answer.

Learning paths

Follow a curated path when you want the surrounding systems context instead of a single isolated deep dive.

Problem

At small scale, a feature flag is a boolean in a database.

At large scale, a feature flag platform becomes a control plane:

thousands of flags
millions of evaluations per second
deterministic rollout semantics
low-latency local decision making
auditable changes
instant kill switches

If the platform is badly designed, the “safe rollout” mechanism becomes one of the most dangerous systems in the company.

First principle

A feature flag system is not primarily a UI problem.

It is a distributed configuration system with two planes:

control plane: author, validate, version, distribute
data plane: evaluate locally on the request path

That separation is the main design move.

Do not put the configuration database on the hot path of request evaluation.

Goals

The platform should support:

boolean and multivariate flags
percentage rollouts
targeting by tenant, user, region, app version, capability
instant kill switches
monotonic rollout progression
full audit history
local evaluation in microseconds to low milliseconds
eventual recovery if a push update is missed

Non-goals

Do not promise:

globally synchronous config updates on every host at the exact same instant
unbounded rule complexity with no effect on evaluation cost
unlimited ad hoc predicates in the request path

Production flag systems work because the request-path evaluator is constrained and compiled.

High-level architecture

operator UI / automation API
  -> config authoring service
      -> validation / compilation service
          -> durable config store
          -> distribution stream

SDK / gateway / service process
  <- push update stream
  <- periodic snapshot polling
  -> local in-memory evaluator

Key idea:

authoring and storage can be slower and richer
evaluation must be local, tiny, and deterministic

Data model

A flag definition usually needs:

{
  "flag_key": "checkout-new-tax-engine",
  "version": 17,
  "type": "boolean",
  "default_variant": "off",
  "rules": [
    {
      "priority": 10,
      "match": {
        "tenant_tier": ["enterprise"],
        "region": ["us-east1", "us-west1"]
      },
      "variant": "on"
    },
    {
      "priority": 20,
      "rollout": {
        "attribute": "tenant_id",
        "percentage": 5
      },
      "variant": "on"
    }
  ],
  "owner": "checkout-platform",
  "created_by": "alice",
  "created_at": "...",
  "change_reason": "progressive rollout after load test"
}

Important properties:

stable flag_key
monotonic version
explicit default
deterministic ordering
audit metadata

Deterministic rollout

Percentage rollout must be sticky.

Bad version:

if rand() < 0.05 then on

This flips the user experience on every evaluation.

Correct version:

bucket = H(flag_key || subject_id) % 10000
enabled if bucket < percentage * 100

That gives:

deterministic assignment
monotonic expansion from 5% to 10%
no per-request randomness

Pick the subject carefully:

user_id if user experience should stay sticky
tenant_id if rollout should be tenant-wide
session_id only when session-level variance is acceptable

Local evaluation

Evaluation should happen from a local in-memory snapshot.

A typical per-request flow:

extract evaluation context
lookup compiled flag object by key
execute ordered match rules
if rollout rule applies, compute deterministic hash bucket
return variant

There should be no:

database round trip
remote RPC to flag service
dynamic code execution

The evaluator should be predictable enough that teams are comfortable using flags on latency-sensitive paths.

Compilation step

Do not ship raw user-authored JSON directly into the hot path.

Compile into:

normalized rule trees
canonical match predicates
validated type-safe attributes
precomputed priority order
efficient rollout hashing metadata

Compilation is where you reject:

invalid attributes
impossible comparisons
overlapping rules with unexpected shadowing
illegal rollout configurations

This is how the control plane prevents operators from shipping pathological logic to every host.

Distribution model

Use push plus periodic pull.

Push

Good for:

fast propagation
kill switches
low steady-state staleness

Examples:

streaming gRPC
xDS-like configuration
Kafka / PubSub fanout through a config delivery layer

Periodic pull / snapshot revalidation

Good for:

recovering missed updates
simplifying late join
verifying local cache integrity

A robust platform uses both.

Versioning and monotonicity

Every host should know:

current snapshot version
last applied change
last successful sync time

Requests and logs should record:

flag_key
flag_version
variant
match_reason

Without version visibility, debugging a rollout is guesswork.

Staged rollout model

A safe rollout system supports these phases:

shadow evaluation: compute but do not act
internal users / test tenants
small percentage rollout
regional / tenant expansion
global enable

Shadow mode is especially important for flags that change:

data writes
billing behavior
routing
authorization

You want proof that the targeting logic matches reality before turning on the effect.

Kill switch design

Some flags exist mainly for emergency disable.

Requirements for kill switches:

highest distribution priority
tiny evaluation cost
clear ownership
tested regularly

If the system can only disable a broken feature after waiting 10 minutes for polling, it is not an operational kill switch.

Multi-region behavior

Most control planes are eventually consistent across regions.

That is acceptable if you are explicit:

normal propagation target: e.g. seconds
emergency propagation target: e.g. sub-second to a few seconds
stale snapshot tolerance: bounded and monitored

If a flag changes routing, auth, quota, or safety behavior, you may need stronger guarantees such as ordered regional rollouts or explicit operator control over region activation.

Failure modes

Control plane unavailable

Data plane should keep serving from last-known-good snapshot.

Push channel broken

Polling should repair state.

Bad config published

Need version rollback, change freeze, and kill switch path.

Partial fleet on old version

Need per-host version telemetry so rollout health is visible immediately.

Schema validation and policy linting

Strong systems reject dangerous config before distribution.

Useful validators:

attribute existence and type checking
duplicate priority detection
dead rule detection
rollout subject sanity
flag dependency cycle detection

Example anti-pattern:

if region == "us-east1" then on
if tenant_tier == "enterprise" then off

If ordering is unclear, operators will reason incorrectly about the result.

Flag dependencies

Avoid deep flag dependency chains on the request path.

Bad:

flag A depends on flag B depends on flag C depends on experiment D

That creates:

evaluation complexity
hidden precedence
operator confusion

If dependencies are necessary, compile them into an acyclic resolved form and enforce depth limits.

Observability

Track:

flag evaluation count by key and variant
snapshot versions by host
push propagation lag
polling freshness lag
config rejection count
rollout distribution skew

For sensitive flags, sampled evaluation logs are useful:

{
  "flag_key": "checkout-new-tax-engine",
  "version": 17,
  "subject": "tenant_123",
  "variant": "on",
  "reason": "percentage_rollout(bucket=311)"
}

Do not log raw PII. Use stable identifiers or hashes.

Common mistakes

1. remote evaluation on the request path

This turns the flag service into a dependency for every request.

2. non-sticky percentage rollout

That causes flicker and invalid experiments.

3. no audit trail

Then incident review devolves into screenshots and memory.

4. too much rule expressiveness

An unrestricted DSL often becomes impossible to reason about or evaluate cheaply.

5. no rollback discipline

If rollback is manual and slow, operators stop trusting the platform.

What the senior answer sounds like

I would design feature flags as a control plane plus local evaluation data plane. Operators write versioned configurations through an authoring API, the config is validated and compiled into a restricted efficient representation, and then distributed to services through push plus periodic snapshot polling. Request-path evaluation must be local, deterministic, and sticky for percentage rollout using a stable hash of flag key and subject. The system needs first-class audit metadata, rollout stages, kill switches, and per-host version telemetry, because the real production problems are bad config, partial propagation, and unclear precedence rather than just boolean lookup.

Key takeaways

A serious flag system is a distributed configuration platform, not a database table.
Keep remote storage out of the hot path; evaluate from local compiled snapshots.
Percentage rollout must be deterministic and sticky.
Use push plus pull, version every snapshot, and log applied versions.
Auditability, rollback, and validation are part of the design, not extras.

Included paths

Use these routes when you want this page to stay anchored inside a larger systems-learning progression.

Global policy enforcement

Learn how policy definition, distributed enforcement, and multi-region coordination fit together for large control surfaces.

What this enables

Once the current design feels natural, these are the best next systems to tackle.

Traffic management

End-to-end design

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Design worldwide quotas without putting a globally serialized dependency in the request path, using hierarchical allocation, leased budgets, and bounded overshoot.

Related directions

These topics live nearby conceptually, even if they are not strict prerequisites.

Traffic management

End-to-end design

Designing a Rate Limiter (at Scale, Production-Grade)

Design a limiter that is actually deployable: low-latency enforcement, burst handling, distributed quotas, multi-region coordination, and failure-safe behavior.

Traffic management

End-to-end design

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Design worldwide quotas without putting a globally serialized dependency in the request path, using hierarchical allocation, leased budgets, and bounded overshoot.

Control plane

Trade-off

Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

Design distributed locking with explicit guarantees, stale-owner protection, and realistic failure semantics instead of assuming a lock magically creates correctness.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Global policy enforcement

Learn how policy definition, distributed enforcement, and multi-region coordination fit together for large control surfaces.

Designing a Rate Limiter (at Scale, Production-Grade) Global Quotas (Hierarchical Budgets Across Regions and Fleets) Feature Flags Control Plane (Versioning, Distribution, and Safe Rollouts) Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

From the blog

Pair the atlas with the broader engineering writing on the site when you want editorial context around the systems mechanisms.

Don't eat too much red meat.

Too much iron ?

January 24, 2026

3 min read

myth science

Linux is just that much better

Yes, even on Apple hardware.

December 9, 2025

2 min read

linux macos

Git rebase is always the wrong choice.

Please, stop making everyone suffer.

December 6, 2025

4 min read

git

Problem

First principle

Goals

Non-goals

High-level architecture

Data model

Deterministic rollout

Local evaluation

Compilation step

Distribution model

Push

Periodic pull / snapshot revalidation

Versioning and monotonicity

Staged rollout model

Kill switch design

Multi-region behavior

Failure modes

Control plane unavailable

Push channel broken

Bad config published

Partial fleet on old version

Schema validation and policy linting

Flag dependencies

Observability

Common mistakes

1. remote evaluation on the request path

2. non-sticky percentage rollout

3. no audit trail

4. too much rule expressiveness

5. no rollback discipline

What the senior answer sounds like

Key takeaways

Included paths

Global policy enforcement

What this enables

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Related directions

Designing a Rate Limiter (at Scale, Production-Grade)

Global Quotas (Hierarchical Budgets Across Regions and Fleets)

Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

More from Control plane

Distributed Locking (Leases, Fencing Tokens, and When Not to Use It)

Paths that include this topic

Global policy enforcement

From the blog

Don't eat too much red meat.

Linux is just that much better

Git rebase is always the wrong choice.