4321 SRAM

12 minute read

Published: December 31, 2025

Authors: Ming Gong, Charlotte Chen

Now you are at the next stage of 4321: the mighty SRAM

Many parts of this PS can be done in parallel. Get your teammate to work!

Floorplan

Throughout this article, an array written as \(x\times y\) always stands for row \(\times\) column

There are a lot of ways to floorplan the SRAM. You don’t have to, but it would be really nice if all the peripherals match the width of the SRAM array.

Because the SRAM cell is extremely dense, we use column multiplexing, sharing one set of R/W circuitry for each two adjacent columns

Logically, the array is 8x8. (8 wordlines, 3-bit addresses; 8 bits for iobus)
Physically, we lay it out as 4x16 (16 bitline pairs), where each pair of the adjacent physical column map to one logical column
Essentially, we shift a dimension from rows into columns to the column: one wordline address bit becomes the column MUX select.

SRAM Array

8x8 Layout

The following demo array was given in Fall 2025. Take a moment to appreciate this fabulous SRAM.

Press Ctrl-F to viewthe blocks:

Above is a piece of v1d1_demo_array_4x4 (8x4), which contains the following: - v1d1_x1

v1d1_wells_strap_x2 (1x2) (1x4)
- v1d1_wells_strap_x1 (power strap)
v1d1_row_edge_x2 (2x1) (4x1) (I think they are decap)
- v1d1_row_edge_x1 (x4)
v1d1_corner_edge_x1
M2 and M4 pins
There is an extra M3 layer at the left that will cause DRC errors. Delete it.

A level down the hierarchy:

Inner Cells

Now Shift-F to show the layers. Try to hide NW and M4, analyze layer by layer, and appreciate such a fantastic design

Below is its intensely annotated stick diagram. I highlighted the cell at wordline<2>, bit<0>, clearly showing the coupled inverters and access transistors.

Try to make your own layout just as cute!

Don’t forget to check DRC and LVS of the SRAM cell. If there are nontrivial errors, tell Shepard to fix immediately!

4x16 Layout

Enough appreciation—it’s time to build.

The provided 8×4 array is constructed from 4×4 blocks. Your task is to reorganize this into 4×16. As long as you understand what’s happening, this is very manageable.

You can assemble a 4x4 cell with 1x1 cells, and then piece 4 of them for 4x16.

Make a 4x4 schematic and symbol with four v1d1_x1 symbols
- 4 wordlines, bitline <3:0>
Generate a 4x4 layout.
- Move entire rows/columns to save effort
- Make sure they are perfectly aligned. I’d like to look at the vias, as their sizes are fixed
  - Sanity check: Does your 4x4 dimension match with the sample 4x4 layout?
- Add instances of the top and bottom wells_strap_x2 to the layout. They are not active components, so they are not part of the schematic.
Create a 4x16 schematic and symbol with four of your 4x4
Generate a 4x16 layout
- Add row_edge and corner_edge instances to the layout
- Again, make sure everything’s perfectly aligned
Check DRC and LVS
Add the pins.
- The width of the M2 wires vary, but there is a pretty clear “center line”
- Measure the distance between the center lines of bit<0> and bit<2>. You should get 2.1 um. This is the bit pitch I’ve been talking about
- Measure the M2 wires to the left of bit<0> and to the right of bit_bar<15> (GND). You should get 16.8 um. If not, something is misaligned.
- Add M2 pins for power and bitlines. They don’t have to be perfectly square.
- Add M3 pins for wordlines and power
- Add M4 pins for power
- You don’t have to be perfect for now. We will refine later

Decoder

Shepard has probably showed off multiple decoder designs in lecture and practice exams. However, for this project, most people end up converting to a static CMOS decoder.

With column MUXing, only two rows need decoding.
Wordlines must be qualified with phi_1.
- Address change happens when phi_2 is high
- Evaluation happens when phi_1 is high.
- Multiple (even transiently) active wordlines will short the cells, leading to catastrophic data corruption.

We chose a 4-in-1 NAND-NOT layout that fits neatly within four SRAM rows and scales naturally with predecoding.

The drawback? Kind of huge in width. It will be better for this project if we can put them to a more squared shape.

The idea is simple:

Spam M3 horizontal wires for all signals and power (Yes they fit)
Use M2 to fetch the signals vertically from M3
Use our M2-VIA2-M1-CO-PO stack to control the gates
Route the outputs with M1-VIA1-M2-VIA2-M3

Implementing it is tricky, but once you have one block, the rest is simple.

Read Write

Below is the 2-bit (4 bitline pairs) R/W schematic, closely following lecture

Column MUX selects between adjacent columns.
Write Select pulls down the bitliens.
- We merged the write and data NMOSes to save a stack height
Read Driver taps from bit_bar, amplifies with a skewed inverter and a bus driver

Testing

Make sure the pre-extraction simulation works consistently before starting layout, so you can isolate errors as early as possible.

At the schematic level, I’ve summarized a few common failure modes based on my own experience and that of my classmates.

Forgot to power vdd!/VDD!/VDD.
- If node voltages hover near 0V, or 0.5V, it’s very likely a power issue
- The SRAM cell implicitly uses vdd!. After extraction, it may appear as VDD!.
- At schematic level, there is a simple way to power all:
  - Use 1V vdc to drive vdd! relative to gnd!.
  - Use 1V vdc to tie all other powers and gnd!
  - Use 0V vdc to tie all other grounds and gnd!
Wordline glitch
- Addresses much change only when phi_2 is high.
- Probe all wordlines to ensure they are one-hot.
Off-by-one inversion
- The lecture circuit is inverted
- Simple test: invert your input vector
Clock phase overlap
- The circuit should work fine if phi_1 and phi_2 are both 50% duty cycle.
- If problems appear, try:
  - Slow down the clock period
  - Reduce the duty cycle
Readability and writability
- These are mostly handled in the SRAM cell.
- Make sure the transistors on the bitline add minimal parasitics.
  - Close-to-minimum sizings are fine.
- Probe bit and bit_bar to see if it’s a skew issue
Tristate issues
- Only one driver should be connected to iobus
- When writing, iobus is driven by testbench sources. The read drivers are set tristate.
- When reading, iobus is driven by the read driver. Use transmission gates in the testbench to disconnect the testbench voltage sources!

Stick Diagram

This is legacy layout. Horizontal gates may save a lot more space

We used a vertical Poly layout for this part. You will see the pros and cons in a moment. We laid out two units at a time.

Column MUX

Four NMOS devices steer either column into the R/W circuitry.

We initially attempted a single-row diffusion-shared layout and missed by a fraction of um :( The final solution uses gate sharing instead.

A lot of wires, but also a lot of space

The pulldown path is short. Good for delay.

Write Pulldown

Now to the interesting (and hard) part. In lecture, we know that we can control the write pulldown NMOS of bit by write AND iobus (iobus_bar for bit_bar). For stability, we also want to qualify this with phi_1. The logic is:

write AND iobus<i> AND phi_1

(write AND phi_1) is the same for all bits. We can factor out the shared term and supply it from outside, drastically reducing the complexity for each cell!

And bubble push:

(write NAND phi_1) NOR iobus<i>

Omg this is too beautiful

We need iobus_bar<i> for bit_bar<i>. Fortunately, there’s ample space to squeeze in an inverter

Vertical Poly allows aggressive diffusion sharing if S/D Contact and Metal is aligned directly under the M2 bit grid. Otherwise, you may waste a lot of space

Read Driver

The read driver’s pretty straightforward.

A skewed inverter to handle bitlines
A large tristate driver for iobus

We chose a 4:1 width ratio. You should test it to make sure it works at schematic level.

We then used a (600/300)x3 C²MOS tristate bus driver. Note that it has a stack height of 2. It is discussed in textbook pp 393

Do not add inverters after a tristate driver. That defeats the point.

With careful tuning and diffusion sharing, you can make everything perfectly fit

What if things don’t fit?

Cry, but don’t overcry. You still have to face it

Check the basics. Start with diffusion sharing, efficient routing/viaing, avoid oversizing, etc.
Resize device. People constantly miss that. Ask yourself: Is the device on the critical path? How slow would it be if I size it smaller? Can I finger it differently? Can I orient it differently?
Use straight lines. Bent routes increase contention not only itself, but its neighbors as well! Try to make routes as straight as possible, or shift them away from tight regions.
Detour. If you’ve really tried, take a detour. Find gaps on each layer, and consider moving your routing to these gaps.
(Temporarily) move to a higher Metal layer. This is detour on another dimension. Only do this for short, local routing, as it may horribly interfere with your global routing plans. Vias are also not cheap.
Accept tradeoffs. If there are truly no ways, you have to increase your spacing. Note that this is not an excuse to sloppy layouts.
A slightly area-inefficient design is not a failure — It’s a deliberate tradeoff. In fact, you can often reclaim the space by fitting in power straps, inverters, or decaps.

An example of diffusion sharing and detour (M1)

CONGRATS ON FINISHING HALF THE DESIGN PROJECT!!!

Peripheral-Peripheral

The rest of the circuit is what I call the “peripheral” of peripheral circuits, which includes:

PMOS for cell precharge
Logic for (write NAND phi_1)
A couple of inverted control signals
Power grids and pins

This is where overall layout organization starts to hurt. The main challenges are:

Decide transistor size to drive the shared control signals
- Calculate the total fanout
- I typically make it under FO4, which is fine in practice
- I chose area over delay, not accurately optimizing large-fanout drivers.
Find a place to place such transistors
Find a place to prettily place such transistors
Floorplan the grids. Integrate well with the rest of the design

Here’s how I did it:
Draw the core transistors within the grid. Pass DRC. (proof of concept)
Roughly connect the remaining structures (even if not DRC clean) to pass LVS (proof of concept)
Go back and refine the details. Resolve the remaining DRC issues.
STOP for now. Don’t try to make it perfect, as you will probably revisit it

Disclaimer: these sizings are for an older version. Calculate your own!

With these decisions in-place, you can lay out the whole thing:

Calibre

It’s a pain dealing with third-party libraries.

LVS

You may get a few LVS warnings on M2 pin short. This is from the v1d1_x1 cells’ M2 pin label. The M4 pins are fine.

We externally connected power through M4 and M2 at the boundary
As long as you leave the “shorted” pins in the center untouched, it should work fine

PEX

You might get the following warnings from extraction:

WARNING: [FDI3034] Schematic instance XI24/XI63/XI3/XI0<0>/M0 not found, use found instance XI24/XI63/XI3/XI0<0> instead.
WARNING: [FDI3046] Failed to create mapping for device "nchpg_sr". Netlist for "XI24/XI63/XI3/XI0<0>/M0" instance has more pins than schematic view.
WARNING: [FDI3014] Could not find cell mapping for device nchpg_sr. Ignoring instance XI24/XI63/XI3/XI0<0>/M0.

Those are fine, since the internal schematic for the 6T SRAM cell is not given. The entire instance will be used for the extraction.

If post-extraction simulation doesn’t work, try a more tolerant testbench timing here

They post-extraction delays are around 50 ns.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Ming Gong