Cache Hierarchy
CPUs can execute instructions billions of times per second. Main memory (DRAM) takes ~200 CPU cycles to respond to a request. Without caches, your CPU would sit idle 99% of the time, waiting for data.
The solution is a memory hierarchy: a series of progressively larger, slower, and cheaper storage levels, with the fastest layers sitting closest to the CPU.
Interactive Visualizer
Click any tier to see its specs. Enter any address and press Access Memory to simulate which cache level serves the request.
Cache Hierarchy Visualizer
Click any level for details. Use the simulator below to trace a memory access through the hierarchy.Registers
~256 B – 1 KB · < 1 ns · 1 cycleL1 Cache
32 – 64 KB · 1 – 4 ns · 4 cyclesL2 Cache
256 KB – 1 MB · 5 – 12 ns · 12 cyclesL3 Cache
4 – 32 MB · 20 – 50 ns · 40 cyclesMain Memory (DRAM)
8 – 64 GB · 60 – 100 ns · 200 cyclesNVMe SSD
256 GB – 4 TB · 50 – 200 μs · 100,000+ cyclesSimulate a Memory Access
Why Multiple Cache Levels?
There is a hard physical tradeoff:
- Faster memory → smaller capacity, higher cost, more power
- Larger capacity → slower access time, cheaper per bit
| Level | Typical Size | Latency | Notes |
|---|---|---|---|
| Registers | ~1 KB | < 1 ns | Named; not addressable |
| L1 Cache | 32–64 KB | 1–4 cycles | Split I/D; per-core |
| L2 Cache | 256 KB – 1 MB | 10–15 cycles | Unified; per-core |
| L3 Cache | 4–32 MB | 30–50 cycles | Shared across all cores |
| DRAM | 8–64 GB | 200+ cycles | Off-chip |
| NVMe SSD | 256 GB – 4 TB | ~100,000 cycles | Persistent |
Core Concepts
Cache Hit vs. Miss
- Cache hit: the requested data is found in the cache → served at cache speed
- Cache miss: data not found → fetch from the next level down (fill the cache line, then retry)
Average Memory Access Time (AMAT)
$$AMAT = HitTime + MissRate \times MissPenalty$$
With L1 hit time = 4 cycles, miss rate = 5%, and L2 access time = 12 cycles:
AMAT = 4 + 0.05 × 12 = 4.6 cycles
Miss the L2 and go to DRAM (200 cycles) and the penalty jumps dramatically.
Cache Lines
Caches don't store individual bytes — they store cache lines (typically 64 bytes). When a miss occurs, the entire 64-byte block is fetched and placed into the cache. This exploits spatial locality.
Locality
Caches work because programs exhibit locality:
| Type | Definition | Example |
|---|---|---|
| Temporal locality | Recently accessed data will likely be accessed again soon | A loop counter accessed every iteration |
| Spatial locality | Data near a recently accessed address will likely be needed soon | Iterating through an array |
Cache Organization
A cache is divided into sets. Each set holds N ways (slots):
- Direct-mapped (1-way): each address maps to exactly one cache slot. Fast to look up, but suffers conflict misses.
- N-way set-associative: each address can go into any of N slots in its set. Better hit rate at the cost of more comparison hardware.
- Fully associative: any block can go anywhere. Used in TLBs; impractical for large caches.
Replacement Policies
When a cache set is full and a new block must be loaded:
- LRU (Least Recently Used) — evict the block accessed least recently. Best approximation to optimal.
- FIFO — evict the oldest block. Simpler.
- Random — surprisingly competitive with LRU and very cheap to implement. Used in some ARM designs.
Performance Impact
// BAD: column-major traversal — 1 cache miss per access
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += A[i][j];
// GOOD: row-major traversal — 1 cache miss per 8 accesses (64B line / 8B double)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += A[i][j];
For a 1024×1024 double matrix, the row-major loop runs ~5× faster on a modern CPU due to L1 hit rate alone.
Further Reading
- Ulrich Drepper — What Every Programmer Should Know About Memory (free PDF)
- Patterson & Hennessy — Computer Organization and Design, Ch. 5