Skip to main content
Back to the DSA atlas
Systems & storage Data structure Advanced

LSM Tree

Trade read complexity for write throughput by buffering writes in memory, flushing sorted runs, and compacting them in the background.

lsm-treestoragedatabaserare

Interactive visualization

Use the controls to connect the idea to concrete operations before diving into the full write-up.

memtable
->
WAL
->
L0 SSTable
Reads consult multiple components, often using Bloom filters to skip SSTables that definitely do not contain the key.

Write-optimized storage through sorted runs

Writes land in memory first, then flush out as immutable sorted files.

Family

Systems & storage

Database indexes, cache policies, query operators, sharding, and external-memory algorithms.

Builds on

2 topics

Read these first if you want the surrounding context.

Unlocks

0 next topics

Use these follow-ups when you want to apply the current idea in a richer setting.

Learning paths

1

This topic appears in curated progressions so you can keep moving without guessing the next step.

Choose this over that

Write-optimized storage is a different trade-off, not a free upgrade

LSM Trees win by postponing order maintenance, then paying for it later with read and compaction complexity.

Systems & storage Open topic

B+ Tree

Choose this when: Ordered reads, point lookups, and range scans should remain predictable on the foreground path.

Choose something else when: Background compaction and write buffering are acceptable trade-offs.

Current topic

LSM Tree

Choose this when: The workload is write-heavy enough that buffered writes and compaction are worth the extra read complexity.

Choose something else when: The system needs simpler read paths or more direct page-level updates.

Systems & storage Open topic

External Merge Sort

Choose this when: You are reasoning about the merge-heavy I/O pattern that compaction is built from.

Choose something else when: The question is about a long-lived indexed storage layout, not a sorting pass.

Problem

B+ Trees are excellent when reads and ordered updates dominate, but some workloads are overwhelmingly write-heavy.

If every insert immediately modifies page-organized on-disk structure, write amplification and random I/O can become the bottleneck.

Intuition

An LSM Tree delays expensive on-disk organization.

Instead of maintaining one perfectly updated tree at all times, it stages the write path:

  1. accept writes into an in-memory sorted structure
  2. append them to a write-ahead log for durability
  3. flush full memory buffers into immutable sorted files
  4. compact those files later in large sequential batches

So the design trades eager organization for batched organization.

Core components

  • WAL: append-only durability log
  • memtable: in-memory sorted structure, often a skip list
  • immutable memtable: waiting to flush
  • SSTables: immutable sorted files on disk
  • Bloom filters / indexes: help avoid checking every SSTable on reads
  • compaction: background merge process that rewrites and cleans overlapping runs

Write path

put(key, value):
    append (key, value) to WAL
    memtable.insert(key, value)

    if memtable is full:
        freeze memtable as immutable
        create new empty memtable
        flush immutable memtable to a new SSTable

The foreground write is cheap because it mostly touches memory and the end of the WAL.

Read path

get(key):
    if key is in memtable:
        return newest value

    if key is in immutable memtables:
        return newest value

    for each relevant SSTable from newest to oldest:
        if Bloom filter says definitely absent:
            continue
        if on-disk index may contain key:
            search the SSTable
            if found:
                return value

    return NOT_FOUND

Reads are harder than writes because the newest value may exist in several layers.

Compaction

Compaction is the background merge engine that keeps the system from degenerating into infinitely many tiny sorted files.

compact(selectedTables):
    stream the selected SSTables in key order
    keep only the newest visible version of each key
    write merged output into larger SSTables
    delete old input tables

This is basically external merge work plus version cleanup.

Leveled vs. tiered compaction

PolicyShapeStrengthWeakness
LeveledEach level has bounded overlapBetter point reads, lower space amplificationMore write amplification
Tiered / size-tieredSeveral runs per level before mergeBetter write throughputMore read amplification

This is the same recurring systems trade-off: write now and clean later, or organize more aggressively up front.

Worked example

Suppose key user:7 is updated three times.

  1. newest value lives in the memtable
  2. old values still exist in one or more SSTables
  3. reads must search newest structures first
  4. compaction eventually merges those files and discards stale versions

So an LSM Tree is not just a tree. It is a pipeline of sorted structures across time.

Complexity and trade-offs

Exact costs depend on compaction policy, but the qualitative picture is stable.

OperationTypical story
WriteVery fast foreground path, usually close to append + memory insert
Point readMay touch several levels, but Bloom filters cut many misses
Range scanGood sequential behavior, but may need to merge overlapping runs
SpaceExtra space is needed during compaction and while multiple versions coexist

The key systems terms are:

  • write amplification: how much extra rewriting compaction creates
  • read amplification: how many structures a read may need to inspect
  • space amplification: how much extra storage is consumed by duplicates and overlap

LSM Tree vs. B+ Tree

QuestionLSM TreeB+ Tree
Write-heavy workloadUsually betterOften worse
Point readsCan be good, but depends on filters and levelsVery predictable
Range scansGood, but may merge multiple runsExcellent ordered leaf scans
Core costBackground compactionPage updates and rebalancing

Key takeaways

  • An LSM Tree is built around buffering, flushing, and compaction.
  • The foreground write path is cheap because it writes sequentially and updates memory first.
  • Reads are harder, which is why Bloom filters, indexes, and compaction policy matter so much.
  • Compaction is the hidden algorithmic heart of the structure.
  • LSM Trees are the natural choice when write throughput matters enough to justify more complex read behavior.

Practice problems

ProblemDifficultyKey idea
LSM Tree overviewHardMemtables, SSTables, and compaction
RocksDB overviewHardReal storage-engine trade-offs
LevelDB implementation notesHardSorted runs plus background merging

Relation to other topics

  • Skip List is a common memtable implementation.
  • Bloom Filter & Cuckoo Filter explain how reads can skip many SSTables cheaply.
  • External Merge Sort captures the merge-heavy flavor of compaction work.
  • B+ Tree is the classic ordered-index alternative when the workload trade-offs differ.

Build on these first

These topics supply the mental model or underlying structure that this page assumes.

Related directions

These topics live nearby conceptually, even if they are not direct prerequisites.

More from Systems & storage

Stay in the same family when you want parallel variations of the same mental model.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Storage engines

Compare fanout-heavy indexes, write-optimized trees, and external-memory sorting in one storage-oriented path.

From the blog

Pair the atlas with recent writing from the rest of the site when you want a broader engineering perspective alongside the topic graph.