LSM Tree
Trade read complexity for write throughput by buffering writes in memory, flushing sorted runs, and compacting them in the background.
Interactive visualization
Use the controls to connect the idea to concrete operations before diving into the full write-up.
Write-optimized storage through sorted runs
Writes land in memory first, then flush out as immutable sorted files.
Family
Systems & storage
Database indexes, cache policies, query operators, sharding, and external-memory algorithms.
Builds on
2 topics
Read these first if you want the surrounding context.
Unlocks
0 next topics
Use these follow-ups when you want to apply the current idea in a richer setting.
Learning paths
1
This topic appears in curated progressions so you can keep moving without guessing the next step.
Choose this over that
Write-optimized storage is a different trade-off, not a free upgrade
LSM Trees win by postponing order maintenance, then paying for it later with read and compaction complexity.
B+ Tree
Choose this when: Ordered reads, point lookups, and range scans should remain predictable on the foreground path.
Choose something else when: Background compaction and write buffering are acceptable trade-offs.
LSM Tree
Choose this when: The workload is write-heavy enough that buffered writes and compaction are worth the extra read complexity.
Choose something else when: The system needs simpler read paths or more direct page-level updates.
External Merge Sort
Choose this when: You are reasoning about the merge-heavy I/O pattern that compaction is built from.
Choose something else when: The question is about a long-lived indexed storage layout, not a sorting pass.
Problem
B+ Trees are excellent when reads and ordered updates dominate, but some workloads are overwhelmingly write-heavy.
If every insert immediately modifies page-organized on-disk structure, write amplification and random I/O can become the bottleneck.
Intuition
An LSM Tree delays expensive on-disk organization.
Instead of maintaining one perfectly updated tree at all times, it stages the write path:
- accept writes into an in-memory sorted structure
- append them to a write-ahead log for durability
- flush full memory buffers into immutable sorted files
- compact those files later in large sequential batches
So the design trades eager organization for batched organization.
Core components
- WAL: append-only durability log
- memtable: in-memory sorted structure, often a skip list
- immutable memtable: waiting to flush
- SSTables: immutable sorted files on disk
- Bloom filters / indexes: help avoid checking every SSTable on reads
- compaction: background merge process that rewrites and cleans overlapping runs
Write path
put(key, value):
append (key, value) to WAL
memtable.insert(key, value)
if memtable is full:
freeze memtable as immutable
create new empty memtable
flush immutable memtable to a new SSTable
The foreground write is cheap because it mostly touches memory and the end of the WAL.
Read path
get(key):
if key is in memtable:
return newest value
if key is in immutable memtables:
return newest value
for each relevant SSTable from newest to oldest:
if Bloom filter says definitely absent:
continue
if on-disk index may contain key:
search the SSTable
if found:
return value
return NOT_FOUND
Reads are harder than writes because the newest value may exist in several layers.
Compaction
Compaction is the background merge engine that keeps the system from degenerating into infinitely many tiny sorted files.
compact(selectedTables):
stream the selected SSTables in key order
keep only the newest visible version of each key
write merged output into larger SSTables
delete old input tables
This is basically external merge work plus version cleanup.
Leveled vs. tiered compaction
| Policy | Shape | Strength | Weakness |
|---|---|---|---|
| Leveled | Each level has bounded overlap | Better point reads, lower space amplification | More write amplification |
| Tiered / size-tiered | Several runs per level before merge | Better write throughput | More read amplification |
This is the same recurring systems trade-off: write now and clean later, or organize more aggressively up front.
Worked example
Suppose key user:7 is updated three times.
- newest value lives in the memtable
- old values still exist in one or more SSTables
- reads must search newest structures first
- compaction eventually merges those files and discards stale versions
So an LSM Tree is not just a tree. It is a pipeline of sorted structures across time.
Complexity and trade-offs
Exact costs depend on compaction policy, but the qualitative picture is stable.
| Operation | Typical story |
|---|---|
| Write | Very fast foreground path, usually close to append + memory insert |
| Point read | May touch several levels, but Bloom filters cut many misses |
| Range scan | Good sequential behavior, but may need to merge overlapping runs |
| Space | Extra space is needed during compaction and while multiple versions coexist |
The key systems terms are:
- write amplification: how much extra rewriting compaction creates
- read amplification: how many structures a read may need to inspect
- space amplification: how much extra storage is consumed by duplicates and overlap
LSM Tree vs. B+ Tree
| Question | LSM Tree | B+ Tree |
|---|---|---|
| Write-heavy workload | Usually better | Often worse |
| Point reads | Can be good, but depends on filters and levels | Very predictable |
| Range scans | Good, but may merge multiple runs | Excellent ordered leaf scans |
| Core cost | Background compaction | Page updates and rebalancing |
Key takeaways
- An LSM Tree is built around buffering, flushing, and compaction.
- The foreground write path is cheap because it writes sequentially and updates memory first.
- Reads are harder, which is why Bloom filters, indexes, and compaction policy matter so much.
- Compaction is the hidden algorithmic heart of the structure.
- LSM Trees are the natural choice when write throughput matters enough to justify more complex read behavior.
Practice problems
| Problem | Difficulty | Key idea |
|---|---|---|
| LSM Tree overview | Hard | Memtables, SSTables, and compaction |
| RocksDB overview | Hard | Real storage-engine trade-offs |
| LevelDB implementation notes | Hard | Sorted runs plus background merging |
Relation to other topics
- Skip List is a common memtable implementation.
- Bloom Filter & Cuckoo Filter explain how reads can skip many SSTables cheaply.
- External Merge Sort captures the merge-heavy flavor of compaction work.
- B+ Tree is the classic ordered-index alternative when the workload trade-offs differ.
Build on these first
These topics supply the mental model or underlying structure that this page assumes.
Merge Sort
Divide the array in half, sort each half, merge them back. Merge sort is the gold standard for stable, predictable O(n log n) sorting.
Skip List
Layer multiple forward-pointer lists on top of each other so ordered search behaves like a randomized tree while staying list-based.
Related directions
These topics live nearby conceptually, even if they are not direct prerequisites.
B-Tree
Store many keys per node so ordered search stays shallow and expensive page reads stay low.
B+ Tree
Keep routing keys in internal nodes and all records in linked leaves so point lookups stay shallow and range scans stay fast.
External Merge Sort
Sort data larger than RAM by generating sorted runs and merging them with mostly sequential disk I/O.
Bloom Filter & Cuckoo Filter
Use tiny approximate membership structures to reject absent keys cheaply before touching slower storage or larger indexes.
More from Systems & storage
Stay in the same family when you want parallel variations of the same mental model.
B-Tree
Store many keys per node so ordered search stays shallow and expensive page reads stay low.
B+ Tree
Keep routing keys in internal nodes and all records in linked leaves so point lookups stay shallow and range scans stay fast.
Cache Eviction Strategies (LRU, LFU, ARC, TinyLFU)
Treat cache eviction as an algorithm choice: recency, frequency, adaptation, and admission all optimize different workloads.
Consistent Hashing
Map keys onto a ring so cluster membership changes move only nearby keys instead of reshuffling the whole keyspace.
Paths that include this topic
Follow one of these sequences if you want a guided next step instead of open-ended browsing.
Storage engines
Compare fanout-heavy indexes, write-optimized trees, and external-memory sorting in one storage-oriented path.
From the blog
Pair the atlas with recent writing from the rest of the site when you want a broader engineering perspective alongside the topic graph.