Skip to main content
Back to the DSA atlas
Systems & storage Algorithm Advanced

External Merge Sort

Sort data larger than RAM by generating sorted runs and merging them with mostly sequential disk I/O.

sortingdiskdatabaserare

Interactive visualization

Use the controls to connect the idea to concrete operations before diving into the full write-up.

heap over run heads
->
emit smallest
->
refill from that run

Sort more data than fits in memory

External merge sort turns merge sort into an I/O-aware pipeline: create sorted runs that fit in memory, then merge them with sequential reads instead of random access.

Family

Systems & storage

Database indexes, cache policies, query operators, sharding, and external-memory algorithms.

Builds on

2 topics

Read these first if you want the surrounding context.

Unlocks

1 next topic

Use these follow-ups when you want to apply the current idea in a richer setting.

Learning paths

2

This topic appears in curated progressions so you can keep moving without guessing the next step.

Choose this over that

Use external sorting when the data outgrows RAM

The comparison is not just "sort or not." It is whether ordered disk passes are cheaper than alternative operators or layouts.

Sorting Open topic

Merge Sort

Choose this when: The whole dataset fits comfortably in memory and plain in-memory recursion is enough.

Choose something else when: Sorting spills beyond RAM and disk passes dominate.

Current topic

External Merge Sort

Choose this when: The data exceeds RAM and sequential multi-pass I/O is the practical way to produce sorted order.

Choose something else when: The problem is really about maintaining a long-lived index or a write-optimized store.

Systems & storage Open topic

Hash Join & Merge Join

Choose this when: The sorted order will immediately feed a merge join or another ordered database operator.

Choose something else when: The workload is not actually paying for order downstream.

Problem

Plain merge sort assumes the whole array fits in memory. Database and analytics systems routinely need to sort data far larger than RAM.

When that happens, the real enemy is not comparison count. It is disk I/O, especially random I/O.

Intuition

External merge sort keeps the merge-sort idea but changes the cost model.

Instead of splitting recursively inside RAM, it works in two big phases:

  1. run generation: read as much as memory can hold, sort that chunk in RAM, and write it back as a sorted run
  2. k-way merging: merge many sorted runs using sequential reads and writes

The algorithm is trying to touch disk in long, predictable sweeps instead of random jumps.

Phase 1: generate sorted runs

Assume memory can hold M pages and the input has N pages.

generate_runs(input, M):
    runs <- []
    while input still has data:
        chunk <- read up to M pages
        sort chunk in memory
        run <- write chunk back to disk
        runs.append(run)
    return runs

After this phase, the input has been transformed into ceil(N / M) sorted runs on disk.

Phase 2: k-way merge

Use one input buffer per run, one output buffer, and a min-heap that stores the smallest current item from each run.

k_way_merge(runs):
    heap <- empty min-heap

    for each run:
        load its first buffered item into heap

    while heap is not empty:
        item, runId <- heap.pop_min()
        append item to output buffer
        if output buffer is full:
            flush it to disk
        if runId has more buffered data:
            push next item from runId into heap
        else if runId has more disk pages:
            load next page from runId
            push its first item into heap

The heap keeps the current minimum among all active runs, which is exactly the same idea as merge sort generalized from 2 runs to k runs.

Worked example

Suppose the file has N = 1000 pages and memory holds M = 100 pages.

Run generation

  • read 100 pages
  • sort them in memory
  • write them back
  • repeat 10 times

Now there are 10 sorted runs.

Merge

If memory can support 10 input buffers plus 1 output buffer, the whole file can be merged in one pass.

Total I/O:

  • read + write during run generation -> 2N
  • read + write during merge -> 2N

So the whole sort costs about 4N page I/Os.

That is the type of accounting systems engineers actually care about.

Complexity

Let R = ceil(N / M) be the number of initial runs, and let k be the merge fan-in.

MeasureCost
Run generation CPUin-memory sort per chunk
Merge CPUO(N log k) comparisons over all items
Initial runsR = ceil(N / M)
Extra merge passesceil(log_k R)
Total I/Oabout 2N * (1 + ceil(log_k R)) page transfers

The asymptotic CPU view matters less than the number of full disk passes.

When external merge sort is the right tool

SituationWhy it fits
ORDER BY on data larger than memorySorting can be staged through disk
Sort-merge joinMerge join needs ordered inputs
LSM compaction-like pipelinesMany sorted runs need to be merged efficiently
Offline data preparationSequential disk passes are acceptable

External merge sort vs. in-memory merge sort

PropertyIn-memory merge sortExternal merge sort
Primary costCPU comparisonsDisk passes and buffer management
Working setFull array fits in RAMData exceeds RAM
Merge shapeUsually 2-wayOften k-way
Key optimization targetClean recursion and cache localitySequential I/O and fewer full passes

Key takeaways

  • External merge sort is merge sort adapted to the memory hierarchy, especially disks and pages.
  • Run generation creates sorted chunks; k-way merging stitches them back together.
  • A heap is the natural data structure for the k-way merge frontier.
  • The most important formula is not a pure CPU bound. It is the number of full read/write passes over the data.
  • This algorithm is foundational for database sorting, merge joins, and LSM-style compaction work.

Practice problems

ProblemDifficultyKey idea
External sorting overviewHardRun generation plus multi-way merge
CMU storage notesHardSorting under page constraints
Use The Index, Luke - sortingHardWhy ordered external processing matters in SQL engines

Relation to other topics

  • Merge Sort supplies the split-and-merge intuition.
  • Heap usually powers the k-way merge frontier.
  • LSM Tree compaction has the same merge-heavy flavor, just spread across levels and files.
  • Hash Join & Merge Join rely on external sorting when ordered inputs are not already available.

Build on these first

These topics supply the mental model or underlying structure that this page assumes.

What this enables

Once the current idea feels natural, these are the most useful next jumps.

Related directions

These topics live nearby conceptually, even if they are not direct prerequisites.

More from Systems & storage

Stay in the same family when you want parallel variations of the same mental model.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Storage engines

Compare fanout-heavy indexes, write-optimized trees, and external-memory sorting in one storage-oriented path.

Database operators

See how external sorting and join strategies turn in-memory algorithms into database execution plans.

From the blog

Pair the atlas with recent writing from the rest of the site when you want a broader engineering perspective alongside the topic graph.