OS Notes: Disk and IO

12 minute read

Published: April 26, 2026

18.1. IO

Issue commands to device
Catch interrupts
Handle error Diverse, portability, performance

Controller and Device

Device expose a finite set of registers to communicate with CPU

Command buffer for start/stop
Data buffer that can be read/written
- E. address, size

Communication Methods

IO Port Space

Old interface, two address spaces

IO port, number assigned
Access with privileged instructions
- in portnum, target
- out portnum, source

Memory-Mapped IO

Mapped to some unique physical offset

Device listens (snoops) a physical range

Advantages

Written in C
No special protection
Convenient memory interface

Communicate through struct pci_config

Inspect with lspci

Need to disable cache!!!

Through page table cache_disabled bit

Interrupt

Raise interrupt when device finished work (or wants attention)

Interrupt controller issues interrupt
CPU acks interrupt
Messaged Signaled Interrupts
- Pin count restrictions Write a small interrupt-describing data block describing data to a special mmap()ed IO address triggers interrupt
- Very programmable

Precise Interrupts

Simple

PC saved in known place
All instructions before fully executed
All instruction after not executed (or no side effects)
Execution state of PC known
What if OOO, many instructions in-flight? Which instruction gets interrupted??

IO Software

Error handling
- Close to hardware as possible
Sync vs asyn
Buffering
Sharable

Performing IO

Programmed IO

CPU does all work
Busy-waiting (pooling)

copy_from_user(buffer, p, count);
for (i = 0; i < count; i++) {
	while (*printer_status_reg != READY) 
		;
	*print_data_register = p[i];
}
return_to_user();

Waste CPU cycle

Interrupt-Drive IO

Process blocked while waiting, another process scheduled

When ready raises interrupt, then copies data

Direct Memory Access (DMA)

Ask DMA controller to move data as a bulk

Has access to system, but independent from CPU
When DMA controller moves data, CPU can do something else
- aka “copy core”
Interrupt when done ![[Pasted image 20260406202800.png|566]]

Memory Protection

Security concerns

Protection with another translation table: IOMMU

Similar to page table protects via MMU
If invalid address, exception raised

Example Devices

Clock

HW Crystal oscillator + pulse counter + holding register

Interrupt when counter becomes 0

SW E. CPU stat, preemption, alarm call, system watchdog (deadlock detection)

When process sleep(10), need to release this process from blocking state after 10 s Create a clock header that wakes a sequence of events

Network Driver

Receive: Network card need to maintain an ring buffer

Throttle TCP/IP or device, when either side too frequent
High degree of concurrency

18.2. Storage Devices

HDD

Cylinder and sector (512b each)
RW head

New devices expose uniform Logical Block Address

Disk controller translates LBA to CHS
OS views as 1D array of sector groups (blocks)

Sequential access much faster than random

Seek Performance

Move to specific track and keep it there

Need more accurate for write, since errors are catastrophic
1. Seeking (track)
- Critical for small loads
- Should to be optimized, schedule requests to shorten seeks
  - Move some data to memory: file system caching
    1. Rotational delay (sector)
- Critical for sequential transfer
  1. Byte transfer
- Much wasted

Disk Management

Placement/ordering requests

Make as sequential as possible
Group as long seeks
Con: if crash, may leave in inconsistent state

Considerations

Internal/external fragmentation
File growth
Sequential/random access time
Metadata overhead

1. Contiguous Allocation

FS allocates space all at once

Space managed by examining bitmap
Metadata: base, size (simple)

Pros

Easy, low overhead
Fast sequential access Cons
External fragmentation
- Can move around
Hard to grow
2. Extent-Based Allocation
Multiple** contiguous regions for file (extent)
ana. memory segmentation
Metadata: array of (base, size)

3. Linked Allocation

Data block part of linked list

Metadata for next pointer

Pros

No external fragmentation
- Similar to paging
File can grow easily
Easy to implement Cons
Slow sequential access
Hard to compute random access
Some overhead

The LL pointers can be stored in a dedicated pointer table Pros

Faster random access: FAT lookup Cons
Slow sequential
Need recovery when crash
4. Indexed Allocation
Metadata: Array of pointers (index) to data blocks
Blocks allocated on demand
i-nodes

Pros

No external fragmentation
Easy growth
- Multiple levels of indexing
Fast random access Cons
Overhead: index blocks
Slow sequential

Scheduling Disk Requests

1. First Come First Served

Process on the order received Pros

Easy, fair Cons
No locality
More average latency

2. Shortest Positioning Time First

Pick request with shortest seek time

High locality, throughput
Starvation
Don’t know which request fastest

3. Elevator

Sweep across disk, serving all requests passed

SPTF, but only in one direction
Switch direction only if no further requests
Locality, bounded waiting

SSD

Organization

Page -> block -> plane (parallel access) -> die -> chip

**Functions

Read
- Very random! Less power, faster
Erase
Program Write takes two steps

Writing

Need to be done at block level

Backup
Erase
Change some pages in memory
Re-program from memory to SSD Very expensive!
- May wear down cell!
- Will not write in place, due to wear balancing

In-Memory Map

Keep memory map of logical -> physical page number

On write, pick unused page, mark previous page free
Repeated write will remap

Metadata

Allocated
Written
Obsolete

100: after power failure, partially written state

Requires a lot of memory, wastes RAM in host

Solution: store FTL map in flash itself! Problem: write amplification
Small writes to empty blocks will lead to intense fragmentation!
Must periodically defragment

Disk Scheduling

No seek time for SSD

No op: pass request directly to device
MQ-deadline
Wear Level Focus: coalesce write

RAID

[!info] Redundant Array of Independent Disks
Used as SLED (Single Large Expensive Disk) logically, but each block may parallel IO
Hard Drive Errors
Checksum error
Transient
Permanent: catastrophic
Seek error: aggressive overshoot
Programming error: requesting nonexistent sector
Controller error

RAID Levels

Level 0
- No redundancy, just logical partition into strips
Level 1
- Duplicate all data
- Double the cost :(
- No write penalty (parallelized)
Level 2
- Data stripping (very small)
  - Parallel access
- Use Hamming code for ECC ($\log(N)$)
  - Correct 1 bit, detect 2 bits
Level 3
- Similar to 2, but only single redundant disk
  - Independent of disk array size!
- Parallel access
- High data transfer rate, but only one IO at a time
Level 4
- Larger strips
- Bit-by-bit parity strips, all stored in parity disk
- Write penalty when IO
Level 5
- Level 4, but parity distributed across all disks
- Loss of any one disk is fine
Level 6
- Two parity calculations, stored in separate blocks on different disks
- Substantial write penalty ![[Pasted image 20260413210602.png]]

Logical Volume Manager

Volume: partition of hard drive

LVM allows combining physical partitions to volume groups logically

IO requests directed to correct physical disk

19. Linux File System

1. Original UNIX

Superblock: sentinel (E. number of blocks, pointer to free list)
Inode: contiguous block, point to data
Data Problem: slow
- Blocks too small
- Index too large
- Transfer rate low - Poor clustering related objects
- ls inefficient
- Inode not local to each other

Modern UNIX FS

Multilevel indexed block allocation

Designed with disk geometry (cylinder groups)

BFS

Partition disk into multiple OSes, or multiple file systems Top level

Super block: FS metadata
Boot block: place bootloader, for hardware
Cylinder group partitions: contain Inode and data blocks
- Minimize disk seeks Allocation goals - Sequential blocks in adjacent sectors - Inode the same cylinder group as file data - Inodes in a directory in same cylinder group Cylinder group
Metadata: super block/CG info
Inode bitmap
Block bitmap
Inodes (128 B each)
Data blocks (4 KB each)

Reserve scattered empty spaces for near allocations

Inode

One Inode for one file, uniquely identified

Directory: also a file, with Inode and data blocks

One dentry per file, with filename and Inode pointer

See via

ls -i

Has direct data block pointers, single indirect, double indirect, …

Extent-based: contiguous mapping of certain blocks on the physical disk

Linking

Static Linking

aka hard link
ln orig_file new_file
ln -p orig_file new_file
Two filenames pointing to the same I-node File deleted only after all references deleted

Symbolic Linking

Special file, data is the name/path to another file

The name doesn’t need to exist, since not pointing to I-node
Symbolic link has its own I-node
Extra overhead, since need to open twice
```
ln -s orig_file new_file
```

Syscalls

Reading file

struct stat *statbuf;     // modes, number of links, ID, ...
int stat(const char *pathname, struct stat *statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *pathname, struct stat *statbuf);   // for ls

Reading directory

struct *dirent;            // single entry information
DIR *opendir(const char *name);           // open directory, get opaque handler
struct dirent *readdir(DIR *dirp);

DIR *dir;
struct dirent *entry;

dir = opendir("/");
while ((entry = readdir(dir)) != NULL))
	printf(" %s\n", entry->d_name);
closedir(dir);

For ls -ls, open file, and do lstat on that

LFS: Log-structured FS

Disk caching optimized, so most seeks are writes. LFS optimizes writes

Buffer writes within memory. Once full, write to a log (checkpoint)

Segment write is sequential, fast Read still random, but mostly handled by cache
Problem: scatter I-nodes
Need an I-node map pointing to I-nodes
Map is cached, and has fixed disk location Kernel runs a cleaner thread
Treat disk as a circular buffer
If crash in-between, losing data Avoid overhead
Group old data in the same segment
Avoid strict circular log, but a list of segments

Segment

Inode interleaved with data

Need store identity of I-node, block
Crashes
Assume write happens atomically
B: I-node bitmap update
I: I-node added
D: Dentry added to directory data block
fsck: File System Consistency Check
Very long time
Can’t fix all senarios, such as B'I D'
Block Consistency
Consistent: either in free list or used list
Missing block: add it back to free list
Duplicate free blocks: fix free list
Duplicate used blocks pointed by 2 I-nodes: make a copy
Journaling
Keep a log of my actions, before updating FS

If crash in the middle, replay the log

Better than fsck
1. Commit dirty blocks to journal
2. Write commit
3. Write to actual disk
4. Free journal space

Still expensive. Need two disk writes

Data journaling: all writes, expensive
Metadata journaling: only metadata
Ordered journal: write file to FS first, then journal metadata
- Possible inconsistency
  Virtual FS
  Various FS under UNIX

Interface: open(), close(), read(), etc.

Does not expose implementation details

Creation

mkfs -t ext3 /dev/sda6

Creates an ext3 FS on partition sda6
Will wipe out all existing data

Mounting

mount -t type device directory

Hard disk mounted to /
Other FS (E. temp, network) …

Data Structures

struct file: instance of open file

Per-process fdtable entry, allowing sharing by copying file pointer
Flags, current position

struct dentry: hard link that contains link and inode number

Break absolute path into dentries
Expensive path resolution, need recursive find

struct inode

[!info] HW7

const struct inode_operations *i_op

Interface to read, write, seek, …

struct super_block: descriptor of a mounted FS

Journaling

Change scheduling algorithm for a particular disk device

Special Purpose FS

RamFS (built on RAM)

HW6

Device Drivers

Interface between specific hardware from software

Loaded through Loadable Kernel Modules (LKMs)
Virtual files, can be interacted with syscalls open(), read()
- Functions specified in struct file_operations
Device type (from stat <file>): major and minor number

E. `/dev/nullzero`

static struct file_operations fops = { 
	.owner = THIS_MODULE, 
	.open = nullzero_open,
	.read = nullzero_read, 
	.write = nullzero_write, 
	.release = nullzero_release, // Called during `close()` 
	.unlocked_ioctl = nullzero_ioctl,   // called during ioctl()
};

static int nullzero_open(struct inode *ino, struct file *fp) { return 0; }
static int nullzero_release(struct inode *ino, struct file *fp) { return 0; }
static ssize_t nullzero_read(struct file *fp, char __user *buf, 
	size_t len, loff_t *off); 
static ssize_t nullzero_write(struct file *fp, const char __user *buf, 
	size_t len, loff_t *off);

fp->private_data is a generic void * to store per-open state information

`ioctl()`

Syscall whose behavior can be specified per device
Fall out of scope of read() and write(), for larger device control

#define NULLZERO_IOC_SET_CHAR _IOW(NULLZERO_IOC_MAGIC, 1, char)
static long nullzero_ioctl(struct file *filep, 
						   unsigned int cmd, unsigned long arg) { 
	char new_c; 
	switch (cmd) { 
	case NULLZERO_IOC_SET_CHAR: 
		if (get_user(new_c, (char __user *)arg)) // Similar to `copy_from_user()` 
			return -EFAULT; 
		c = new_c; 
		break; 
	default: 
		return -ENOTTY; // "Not a typewriter"; for invalid `ioctl()` 
	} 
	return 0; 
}

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Ming Gong

18.1. IO

Category

Controller and Device

Communication Methods

IO Port Space

Memory-Mapped IO

Interrupt

Messaged Signaled Interrupts

Precise Interrupts

IO Software

Performing IO

Programmed IO

Interrupt-Drive IO

Direct Memory Access (DMA)

Example Devices

Clock

Network Driver

18.2. Storage Devices

HDD

Seek Performance

Disk Management

1. Contiguous Allocation

2. Extent-Based Allocation

3. Linked Allocation

4. Indexed Allocation

Scheduling Disk Requests

1. First Come First Served

2. Shortest Positioning Time First

3. Elevator

SSD

Organization

Writing

In-Memory Map

Disk Scheduling

RAID

Hard Drive Errors

RAID Levels

Logical Volume Manager

19. Linux File System

1. Original UNIX

Modern UNIX FS

BFS

Inode

Linking

Static Linking

Symbolic Linking

Syscalls

LFS: Log-structured FS

Segment

Crashes

fsck: File System Consistency Check

Block Consistency

Journaling

Virtual FS

Creation

Mounting

Data Structures

Special Purpose FS

HW6

Device Drivers

E. /dev/nullzero

ioctl()

Share on

You May Also Enjoy

Understanding Hazards

Understanding Hazards

42030青春版–一套刚刚好的乐高

Embedded Project: CircuitSim

`fsck`: File System Consistency Check

E. `/dev/nullzero`

`ioctl()`