1-Memory

18 minute read

Published: December 20, 2025

DRAM

Dynamic Random Access Memory: address -> data Content addressable memory: key -> address (Unit 3)

On PCB, DRAM packaged in DIM (Dual Inline Memory Module)

Usually 8 chips (9-10 if error correction)
Connected to CPU/SOC using the DDR (dual data rate) bus
DDR: data can be transferred at both clock edges (double frequency)
- GDDR: graphics
- LDDR: low-power
DDR5: latest standard
- Data bus width: 64 bits (32x2)
- Address bus: 21 bits (Longer addresses take multiple cycles; slow if too wide)
- Command bus: 7 bits (?)
- 64 ms refresh
DDR Standard: JEDEC, allow different manufactures to interoperate
In SOC, the Memory Controller controls the DRAM
- May have more than 1 MCs
- Each MC can be attached to more than one DRAM modules
- The connection DDR is a channel

DRAM Cell

G: word line; D: bit line

Write:

Activate word line
bit line = value

Read:

Activate word line,
Discharge value = bit line -> Analog sense amplifier (amplifies to drive)
Take output, and restore the value

Subarray

aka “mats”, square to normalize latency

Bit line connected to sense amplifiers
Address -> decoder -> word line cell -> subarray -> mat -> sub-bank (~4) -> bank -> chip -> rank -> channel -> DIMM

Bank groups (BG0-7) on a DRAM die

Page buffer in one bank: sense amplifier that holds a data row

Each row can be as wide as 8K
Refreshing can happen concurrently (other banks can be refreshed while one is read)
Any command from processor is broadcasted to all dies

E. DRAM bus runs at 2400 MHz. What is the peak bandwidth between the memory and the SoC?
2400 * 2 * 8 bytes (maximum)

Don’t forget to x2
64-bit: 8 byte per transfer

Commands

`REFRESH`

$t_{REFI}$: maximum supported refresh interval (~4000 ns)
- Limits the peak BW
$t_{REFC}$: blocking time for refresh per row (~200 ns)
- E. about 95% BW available

`READ`

ACT (activate row): drop all data to the bottom buffer on the bank
- Energizes a particular wordline for subsequent access
- $t_{RCD}$ (row-to-column delay) from ACT to CAS
CAS (column select)
- $t_{CL}$ (column-read delay) for data to be read out
- Data bursted out then (bursted out in group of 16 (burst length))
PRE (precharge): so the next ACT can happen
- Precharges the row to $\frac {V_{DD}} 2$ to drive the sense amplifier
- $t_{RAS}$ (row-access strobe delay) between PRE to the next ACT
- If reading a different row in the same bank, $t_{RP}$ (row-precharge)
  - Not necessary if reading the same row

Error

Soft Error Prone to particle strikes (E. $\alpha$), may trigger bit flips

Want detect and recover
Store data redundantly (E. 2x to detect, 3x to correct)
ECC (E. 2D parity)

Hard Error DIM-SOC bus line stuck at 0 all the time

Failure model

FIT: failure of 1 GiB of memory per 1 billion hours of operation

~15 for memory
~100 for GPU memory

SECDED: correct 1-bit faults, detect 2-bit faults

Typically done in 64-bit chunks Chip kill: make sure still works if one chip in DIMM fails

Correction Scheme

TMR (Triple Modular Redundancy): give 3 copies to have a majority vote

Very cumbersome (3x load latency + comparison; 1/3 capacity)
Can correct any number of bit faults

Error Correction Code

Sometimes black box
Support errors in the check bits
Data $d$ (E. 64 bits) + check code (E. 16 bits)
Multiply $d$ by a constant $m$. Store $dm$ in memory
When retrieving, divide by $m$. If no remainder, good;
If there’s a remainder $r$, we know which bit to flip

\[d' = d\times m \pm 2^k\]

where $k$ is the position of the bit flip To correct,

\[\frac{d'}m = \frac{d\times m}{m}\pm \frac{2^k}m\]

Need to find $m$ such that $\frac{2^k}m$ produces unique remainder for all $k$ (E. 243)

Constraint on $m$: minimize the number of check bits
For double error detection, you get a remainder that matches nothing

Error Correction Scrubs (ECS)

Error accumulate over time so you can’t correct

Solution: scrub memory periodically
Error correction happens in the memory controller: Need to read + write from the chip
Also a blocking operation

DDR5 only does SEC (no double-bit) (Q: is it required on thei chip, or ???)

Random Access Strobe timing

\[t_{RAS} = t_{RCD} + t_{CL} + t_{RP}\]

E. DDR5 memory, 1 DIMM on the system, 8 GiB, no ECC, clock 3000 MHz. Estimate the latency of 2 successive (back-to-back) reads to two different rows in the same bank

8 chips on the DIMM, each with 8 Gib of memory; to output 64 bits of data
\[t_{RCD} + t_{CL}*n + t_{RP} + t_{RCD} + t_{CL}*n\]
- Multiple $t_{CL}$s that reads burst of data
- No need for last recharge E2. Same condition, but to the same row
\[t_{RCD} + t_{CL}*n + t_{CCD} + t_{CL} * n\]
No need $t_{RP}$ twice
$t_{CCD}$: column-to-column delay E3. Two different banks E4. Two different DIMMs
…

Cache

Speed limited by memory

Even worst due to instruction fetching (same time as memory access)
Store these locally within the SoC in faster memory -> cache

SRAM

Bistable element, with charging capacitor on each side

On die
Same process technology for memory as with CPU
Planar CMOS, trench capacitor

Hierarchy

CPU
L1I, L1D (L1I -> CPU -> L1D pipeline)
L2
L3
Main memory Each layer has ~10x capacity & access time
Temporal locality: loop
Spatial locality: array

Physical organization

2D array

Line (block) size: 64B
x # of cache lines Address -> decoder -> tag array -> data array

When caching address, tag address = x « $log_2$(cache line size)

Associativity

Direct mapped: one addresses only map to one index
Set associative: one address maps to a set of indices (aka ways). May be in either of it. Decoder checks if either tag in the ways
If hit, extract address offset

Policy

Evict policy

Random
Least recently used item (LRU)
Most recently used item (MRU)
OPT: relying on a oracle knowledge
Adaptive (based on ML algorithms)

Store policy

Hit

Write-back: only update L1, mark line dirty
Write-thru; Update propagate to higher levels (L2, L3, …)
- Simple coherence
- Slow Miss
Write allocate: bring data to cache and write
- Useful for calloc()
- Good temporal locality
- Works well with writeback
Write around: skip level you miss. Update the level where data hits
- Useful if no further reuse (E. logging, stream media, free())

Between levels miss (level specific):

Inclusive cache: $level_x \subseteq level_{x+1}$. Store in all intermediate level
Exclusive cache: a cache line exists uniquely in one level.
- aka victim caching (evict to next level)
NINE: neither, no commitment, in-between

MSHR

Performance optimizations for loads

LD x
LD x+4
LD x+8
LD x+12

Suppose x misses in L1D. Cache Controller issues to L2

While L2 is accessing x, processor looks for x+4
- Wasteful here to issue another miss for x+4
- Processor fold subsequent misses to the same pending cache line This is called MSHR (Miss Status Handling Registers). Tracks all pending misses
Table of missed addresses, corresponding to PC of all subsequent loads
One table for each level of cache

HW Table has limited size of the number of addresses, and the number of PCs for each address.

If no space, stall CPU (no folding available)

Write Buffer

Performance optimization for store misses

st x 
st x+4
...

Buffer up the stores in a write buffer (mini cache with addresses and 1 line data)

Once filled, evicts and sends to the next level. If you load the same address as stored, data forward directly from buffer

Example

Q: How many lookups happen in L2, and how many writes go to the DRAM? L1D MSHR: 4 entries miss[3:0], each can handle 16 misses Suppose load 32K of ints, 64B cache line size
First miss instance is stored in a cache line
Subsequent misses on the same line are buffered (folded)
For L2 layer, need 32K / (4 * 16) = 512 cache line lookups
Without MSHR, 64x more
16 * 4 due to 16 fold * 4 bytes for int
Exact same as L1D line size, since int is the most common
Every 16 misses are collapses into one L2 lookup

L1D store buffer: can combine consecutive writes to 1 cache line

Suppose store 16K ints, 64B cache line size
If L1D store miss, send to L2, but not individually
Write buffer collects every write into a single cache line st c[15:0] (for int)
When line filled, send the request to L2 to write
Without store buffer, L2 4096 writes * 4 byte
With store buffer, LL2 4096 / 16 = 256 writes * 64 bytes
Same data, but less metadata (location, write size, etc.)
In L2, still write allocate system -> all misses -> send to DRAM
No additional merging possible. All misses are cache line size
If you alternate cache writes, st c[i]; st d[i], store buffer will be useless due to conflict

DRAM traffic

Memory Controller (MC) that receives all miss traffic from SoC. Makes request (PRE, ACT, column R/W) to DRAM
Assume
16 GiB DIMM
8 Bank groups (each DRAM is x8, also sending 8 bits of data)
- Each BG has a row buffer
- 8 open rows per DRAM
- 64 open rows per DIMM
4 sub-banks per bank group
2 subchannels per DIMM
Miss traffic trace
Assume no code misses. Data only
Reason from L2 output

read A[0:15]
read B[0:15]   
// still stuck here, but processor will keep requesting
// non-blocking write
read A[16:31]
read B[16:31]
write C[0:15]

Q: How should PA -> DRAM address mapping be set up to improve parallelism
Make sure A, B, C need to be written to different banks concurrently
If same bank, blocking each other. Won’t need to issue separate open and close
E. BG0 has slice of A, BG1 has slice B
When MC controlling cache line requests into DRAM commands, process out of order
Readings are more critical. Make it happen before the write
Instruction page > read > write
aka memory scheduling

Cache causality

Reason of misses:

Cold (compulsory): first reference
Capacity: too small (even if fully associative)
Conflict: addresses map to the same cache line

Assume cache is infinitely large and fully associative
- Any miss will be cold -> label them
- Solution: pre-load all data into cache (prefetch)
  - Hardware prefetcher (ML-based)
- Also increase block size
Finite capacity, FA
- Misses: cold + capacity
- Solution: increase capacity
Finite capacity, SA/DM
- Will have everything
- Subtract the previous two out -> conflict misses
- Solution: increase associativity
- change cache indexing (hashing function)
- change replacement policy

Capacity -> latency
Block size -> latency (but less extent)
Associativity -> latency

Fully associative structure

CAM: content associative memory

(tag, data) pairs
Search the tag in parallel in every row. If match, data out.
- Typically 1 clock cycle
- Number of items searched in 1 cycle: search scope of a scan
- XNOR check every bit of the tag -> every bit ANDed
- The key bits are broadcasted to every cell -> latency increases very rapidly

PORT

How many R/W can do to a given memory at a time

E. 1R1W; 1RW; 2R1W
1RW: standard SRAM array. Only one address (wordline) you can energize
2R: additional wordline and bitlines attached to every cell. Two sets of sense amplifiers. To transistors controlling the read -> more latency

Example:
RF: 2nR, nW (n is the number of concurrent instructions in-flight)
L1: 2R, 1W
L2: 1RW

HW1

clflush(ra): evict from cache block. If cache line is 64 bytes, need to do for all ra % 64
rdtsc: timer, issued before and after
- Need to serialize the instructions
- Use rdtscp
- Or insert cpuid just before the rdtsc
- lfence
mfence: prevent multiple CPU cores to interact with memory at the same time Noise reduction techniques:
Pin process to specific cores: taskset
Pin certain page for DRAM: mlock

Virtual Memory

Very large linear array, abstraction. Allow portable code

0 to $2^N-1$.
The next location is previous location + 1
Not all physically addressable

Caching

How to provide such VM illusion Physical memory = populated DRAM (~GB) + size of swap space (E. SSD, HDD, tape ~TB) Solution: cache VM chunks into physical memory Raises a few questions

Block size? page size (OS page size, not to be confused with DRAM row buffer size)
- In modern OS, 4 kB, 1 MB, 2 MB, 1 GB
Associativity? fully associative
- When malloc() called, a chunk is allocated and mapped to the DRAM page
  - Allocates anonymous data. Not stored to hard drive
  - After certain malloc()s, DRAM fill out. Moved to swap space
  - If swap space also fills up: Out Of Memory (OOM). Can no longer illusion
- For file-backed data reading, resides on non-volatile storage
  - Placed in DRAM page cache when read
- If miss in DRAM, swap space fetch cost is very expensive. Associativity reduced conflict misses.
  - Search is done in software (OS)
- Can’t do FA for lower-level cache, since hardware algorithms for searching data can’t execute in ns.
Eviction policy?
Store hit policy? exclusive between swap space and DRAM
- Except page cache in DRAM is inclusive (caches hard drive files)
  - Flush command pushes page cache to hard drive
- For anonymous data, DRAM is the final destination
Store miss policy? write allocated to DRAM

This is called software managed cache

Replacement in SW (higher latency)
Fully associative

Page Table

Data and metadata (ana tag)

Page address
Permission (rwx)
WC (does it support multiple consecutive writes)
UC (uncacheable)
User/kernel Need to know which pages are present in DRAM + Swap

Data structure: page table Need to

Locate item quickly (E. n-ary tree)
allocate/deallocate pages
Small in size

Page walk

aka search through the tree for address translation

Implemented in HW or SW
E. if tree has 5 levels, need 5 reads to get a memory address.

TLB

Cache page walks -> Translation Lookaside Buffer

Special cache for VA -> PA pairs
Same level as L1
One page walk done, cached to TLB

VM cannot be directly mapped to cache, since different programs will have different VA. Need to first translate to PA

Can directly translate if sure that PA not used by any other addresses Program -> VA -> PA (page walk, TLB cached, else expensive) -> cache
TLB latency + cache latency
50% overhead :(

VIPT

Optimization: given VA, lookup in TLB and cache concurrently

VIPT caches (Virtually Indexed Physically Tagged)

Virtual address to compute index
Physical address to check tag
Need VA -> PA to conserve index & offset bits
$log_2$(page size) = index + offset bits

Translation overhead

PA only: embedded system
- If you know exactly how many processes the system has.
- No dynamism
Page table
- High radix structure (tree) to optimize the search
- 4-5x overhead!
Cache VA -> PA translation in TLB
- Virtual page number -> physical page number
- Only pay cost once
- TLB is fully associative, since misses are very expensive
VIPT: hide TLB lookup latency under cache latency
- Virtual page number cache index cache offset

Optimizations

Cache size limited by page offset bits. Suboptimal Solution: make cache set associative, so each index can store multiple tags

Allows larger total cache size for same index bits
If TLB miss, raise an exception (TLB fault), handled by OS or hardware microcode
- Page walk, and fill in TLB
- Build HW caches for each page walk level (again)
  - Known as MMU caches
If no translation exists (first-time access): page fault

Example

Virtual -> physical -> cache -> DRAM

int *a = malloc(4096 * sizeof(int));  // dynamically allocated
// same for b and c
c[i] = a[i] + b[i];

Suppose 64-bit VA space, 40-bit PA space -> $2^{40}$ max DRAM capacity
OS page size: 4 KiB

Max page

Max working set size (data only) = 3 * 4096 * 4B= 48 KiB -> 12 page table entries
malloc() will align to page entries
- .align compiler command

Physical Config

Suppose:
L1D: 64 k, 64 B lines x 64 sets x 16 ways
L2D: 256 KiB, 64 B lines x 16k
L3: 8 MiB, 64 B lines
Writeback for hit, write around for miss
4 KiB page size
VIPT

Q: What is the size of the L1D tag array, assuming a VIPT cache - Page size = 64B x 64 sets = 4K - 40-6-6=28 bits of tag (use physical size) for each line - Total: 28 bits * (64 sets * 16 ways)

Assume a FA, 16-entry TLB
- TLB is 1 valid bit + VPN tag + PPN tag
- Size of one entry = 1 + (64-12) + (48-12) = 69 bits
If TLB includes the bits used from MRU replacement,
- 4 bits of index for MRU entry, 4 more bits
If include LRU replacement,
- 4 bits for each MRU entry, 16 more bits
- OR a matrix with comparison bits for each entry: 16 * 15 / 2
  - More data, but simpler comparison algorithm

Code Optimization

To improve locality E. A 2D (byte) matrix is laid out linearly in memory

Load The bae address
Add offset (0)
Store this address to somewhere else in memory With cache (E. 4-entry, FA cache, line size: 4B, writeback on store hit, write allocate on store miss)

Suppose transposing a matrix, row accesses will hit, but column ones will miss

Memories are in row-major order, so the columns will miss!
Change how arrays are accessed, and increase
Fetch a square block of data into the cache -> Cache blocking
- Set up loop index, so the amount of data accessing fits into the cache size

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Ming Gong

DRAM

DRAM Cell

Subarray

Commands

REFRESH

READ

Error

Failure model

Correction Scheme

Error Correction Scrubs (ECS)

Cache

SRAM

Physical organization

Associativity

Policy

Evict policy

Store policy

MSHR

Write Buffer

Example

Cache causality

Fully associative structure

PORT

HW1

Virtual Memory

Caching

Page Table

Page walk

TLB

VIPT

Translation overhead

Optimizations

Example

Physical Config

Code Optimization

Share on

You May Also Enjoy

Understanding Hazards

42030青春版–一套刚刚好的乐高

Embedded Project: CircuitSim

Demystifying Cache—From Bytes to Tags

`REFRESH`

`READ`