Hash Join & Merge Join
Compare the two most important database join strategies: build-and-probe hashing versus ordered merge scanning.
Interactive visualization
Use the controls to connect the idea to concrete operations before diving into the full write-up.
Different joins win under different access patterns
Hash join is usually the default when equality predicates dominate and one side fits comfortably as a hash table.
Family
Systems & storage
Database indexes, cache policies, query operators, sharding, and external-memory algorithms.
Builds on
2 topics
Read these first if you want the surrounding context.
Unlocks
0 next topics
Use these follow-ups when you want to apply the current idea in a richer setting.
Learning paths
1
This topic appears in curated progressions so you can keep moving without guessing the next step.
Choose this over that
Choose the join operator that matches the data layout
Hash join and merge join can produce the same rows while paying very different preparation costs.
Hash join
Choose this when: The join is an equality join, one side fits in memory, and no useful order already exists.
Choose something else when: Sorted order already exists or should be preserved for downstream work.
Merge join
Choose this when: Both sides are already sorted or sorting them still pays off because ordered pipelines matter.
Choose something else when: Hashing one side is cheaper than creating sorted inputs.
External Merge Sort
Choose this when: You need the disk-aware sorting pipeline that often makes merge join possible at scale.
Choose something else when: The inputs are already ordered or hash join is the better operator.
Problem
A SQL join is not one algorithm. A database optimizer must choose a physical operator based on:
- join predicate shape
- whether inputs are already sorted
- how much memory is available
- whether disk I/O will dominate the work
Two of the most important operators are hash join and merge join.
Intuition
Both operators solve the same high-level task: pair rows from R and S whose join keys match.
They differ in what structure they build first:
- hash join builds a hash table on one side, then probes it
- merge join relies on both sides being sorted, then walks them like a merge step
So the choice is really:
should we pay for hashing, or should we pay for sorted order?
Hash join
Hash join is the standard equality-join workhorse.
Algorithm
Choose the smaller input as the build side.
hash_join(R, S):
H <- empty hash table
for row in R:
H[row.key].append(row)
for row in S:
for match in H[row.key]:
emit combine(match, row)
Why it works
The build phase groups rows by join key. The probe phase turns each lookup on S into a constant-time average hash-table access.
Best use case
- equality joins
- smaller build side fits comfortably in memory
- no useful sorted order already exists
If the build side does not fit in memory, databases fall back to partitioned or grace-hash variants that hash both sides into smaller chunks first.
Merge join
Merge join wins when both inputs are sorted on the join key, or when sorting them is still cheaper than building and probing hashes.
Algorithm
merge_join(R_sorted, S_sorted):
i <- 0
j <- 0
while i < len(R_sorted) and j < len(S_sorted):
if R_sorted[i].key < S_sorted[j].key:
i <- i + 1
else if R_sorted[i].key > S_sorted[j].key:
j <- j + 1
else:
key <- R_sorted[i].key
rGroup <- all rows in R_sorted with this key
sGroup <- all rows in S_sorted with this key
emit every pair from rGroup x sGroup
advance i and j past those groups
Why it works
Once both sides are sorted, smaller keys can never match larger keys later. That lets the pointers move only forward.
Best use case
- inputs are already sorted by index or previous operators
- range-oriented ordered pipelines matter
- merge order will be reused by downstream operators
Worked example
Suppose we join:
Orders(order_id, customer_id)Customers(customer_id, name)
Hash join view
Build a hash table on Customers by customer_id, then stream through Orders and probe each order’s customer.
Merge join view
If both tables are already sorted by customer_id, keep two pointers and advance whichever side is smaller until keys line up.
The output rows are identical. The cost profile is not.
Complexity and cost model
Let |R| = r and |S| = s.
| Operator | CPU view | Memory / I/O story |
|---|---|---|
| Hash join | O(r + s) average after building the table | Great if build side fits in memory; spills require partitioning |
| Merge join with pre-sorted inputs | O(r + s) | Excellent when sorted order already exists |
| Merge join with sorting required | O(r log r + s log s + r + s) | External merge sort may dominate if inputs are large |
In real systems, the I/O story often matters more than the CPU formula.
When to choose which
| Situation | Better choice | Reason |
|---|---|---|
| Equality join, one side fits in memory, no useful ordering | Hash join | Cheap build-and-probe path |
| Both sides already sorted by join key | Merge join | Reuse existing order and scan once |
| Ordered output helps later operators | Merge join | Sorted pipeline stays useful downstream |
| Inputs are huge and sorting would be expensive | Hash join if memory permits | Avoid paying sort cost |
Key takeaways
- Join strategy is an algorithm choice, not just a database implementation detail.
- Hash join turns the join key into a hash-table lookup problem.
- Merge join turns the join key into a two-pointer ordered-scan problem.
- If sorted order already exists, merge join can be extremely attractive.
- If no order exists and an equality join dominates, hash join is usually the first candidate.
Practice problems
| Problem | Difficulty | Key idea |
|---|---|---|
| Hash join overview | Hard | Build side plus probe side |
| Sort-merge join overview | Hard | Ordered two-pointer join |
| CMU 15-445 execution notes | Hard | Physical operator trade-offs inside query engines |
Relation to other topics
- Hash Map is the core data structure behind hash join.
- Merge Sort and External Merge Sort explain where merge join gets its ordered inputs.
- B+ Tree indexes often provide the ordered access pattern that makes merge-style processing attractive.
Build on these first
These topics supply the mental model or underlying structure that this page assumes.
Merge Sort
Divide the array in half, sort each half, merge them back. Merge sort is the gold standard for stable, predictable O(n log n) sorting.
Hash Map
O(1) average-case lookups via hashing. Understand collision resolution, load factors, and when hash maps are the right (or wrong) choice.
Related directions
These topics live nearby conceptually, even if they are not direct prerequisites.
B+ Tree
Keep routing keys in internal nodes and all records in linked leaves so point lookups stay shallow and range scans stay fast.
External Merge Sort
Sort data larger than RAM by generating sorted runs and merging them with mostly sequential disk I/O.
More from Systems & storage
Stay in the same family when you want parallel variations of the same mental model.
B-Tree
Store many keys per node so ordered search stays shallow and expensive page reads stay low.
B+ Tree
Keep routing keys in internal nodes and all records in linked leaves so point lookups stay shallow and range scans stay fast.
Cache Eviction Strategies (LRU, LFU, ARC, TinyLFU)
Treat cache eviction as an algorithm choice: recency, frequency, adaptation, and admission all optimize different workloads.
Consistent Hashing
Map keys onto a ring so cluster membership changes move only nearby keys instead of reshuffling the whole keyspace.
Paths that include this topic
Follow one of these sequences if you want a guided next step instead of open-ended browsing.
Database operators
See how external sorting and join strategies turn in-memory algorithms into database execution plans.
From the blog
Pair the atlas with recent writing from the rest of the site when you want a broader engineering perspective alongside the topic graph.