Skip to main content

Cache Hierarchy

CPUs can execute instructions billions of times per second. Main memory (DRAM) takes ~200 CPU cycles to respond to a request. Without caches, your CPU would sit idle 99% of the time, waiting for data.

The solution is a memory hierarchy: a series of progressively larger, slower, and cheaper storage levels, with the fastest layers sitting closest to the CPU.

Interactive Visualizer

Click any tier to see its specs. Enter any address and press Access Memory to simulate which cache level serves the request.

Cache Hierarchy Visualizer
Click any level for details. Use the simulator below to trace a memory access through the hierarchy.
Registers
~256 B – 1 KB · < 1 ns · 1 cycle
L1 Cache
32 – 64 KB · 1 – 4 ns · 4 cycles
L2 Cache
256 KB – 1 MB · 5 – 12 ns · 12 cycles
L3 Cache
4 – 32 MB · 20 – 50 ns · 40 cycles
Main Memory (DRAM)
8 – 64 GB · 60 – 100 ns · 200 cycles
NVMe SSD
256 GB – 4 TB · 50 – 200 μs · 100,000+ cycles
Simulate a Memory Access

Why Multiple Cache Levels?

There is a hard physical tradeoff:

  • Faster memory → smaller capacity, higher cost, more power
  • Larger capacity → slower access time, cheaper per bit
LevelTypical SizeLatencyNotes
Registers~1 KB< 1 nsNamed; not addressable
L1 Cache32–64 KB1–4 cyclesSplit I/D; per-core
L2 Cache256 KB – 1 MB10–15 cyclesUnified; per-core
L3 Cache4–32 MB30–50 cyclesShared across all cores
DRAM8–64 GB200+ cyclesOff-chip
NVMe SSD256 GB – 4 TB~100,000 cyclesPersistent

Core Concepts

Cache Hit vs. Miss

  • Cache hit: the requested data is found in the cache → served at cache speed
  • Cache miss: data not found → fetch from the next level down (fill the cache line, then retry)

Average Memory Access Time (AMAT)

$$AMAT = HitTime + MissRate \times MissPenalty$$

With L1 hit time = 4 cycles, miss rate = 5%, and L2 access time = 12 cycles:

AMAT = 4 + 0.05 × 12 = 4.6 cycles

Miss the L2 and go to DRAM (200 cycles) and the penalty jumps dramatically.

Cache Lines

Caches don't store individual bytes — they store cache lines (typically 64 bytes). When a miss occurs, the entire 64-byte block is fetched and placed into the cache. This exploits spatial locality.

Locality

Caches work because programs exhibit locality:

TypeDefinitionExample
Temporal localityRecently accessed data will likely be accessed again soonA loop counter accessed every iteration
Spatial localityData near a recently accessed address will likely be needed soonIterating through an array

Cache Organization

A cache is divided into sets. Each set holds N ways (slots):

  • Direct-mapped (1-way): each address maps to exactly one cache slot. Fast to look up, but suffers conflict misses.
  • N-way set-associative: each address can go into any of N slots in its set. Better hit rate at the cost of more comparison hardware.
  • Fully associative: any block can go anywhere. Used in TLBs; impractical for large caches.

Replacement Policies

When a cache set is full and a new block must be loaded:

  • LRU (Least Recently Used) — evict the block accessed least recently. Best approximation to optimal.
  • FIFO — evict the oldest block. Simpler.
  • Random — surprisingly competitive with LRU and very cheap to implement. Used in some ARM designs.

Performance Impact

Writing Cache-Friendly Code
// BAD: column-major traversal — 1 cache miss per access
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += A[i][j];

// GOOD: row-major traversal — 1 cache miss per 8 accesses (64B line / 8B double)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += A[i][j];

For a 1024×1024 double matrix, the row-major loop runs ~5× faster on a modern CPU due to L1 hit rate alone.

Further Reading

  • Ulrich Drepper — What Every Programmer Should Know About Memory (free PDF)
  • Patterson & Hennessy — Computer Organization and Design, Ch. 5