OS Notes: ASP
Published:
1-2. Intro
- Abstraction
- Manages and hides detail
- Protection
Early OS
Just a library of standard features
- Assume one program at a time
- No malicious program
- Poor utilization (idle) Solution: protection
- Preemption: reclaim CPU from a running process
- Kernel regains control whenever interval timer fires
- Supervisor mode
- Access control
Preemption
#include <sys/time.h>
#include <sys/stat.h>
#include <assert.h>
double GetTime() {
struct timeval t;
int rc = gettimeofday(&t, NULL);
assert(rc == 0);
return (double) t.tv_sec + (double) t.tv_usec/1e6;
}
void Spin(int howlong) {
double t = GetTime();
while ((GetTime() - t) < (double) howlong)
; // do nothing in loop
}
int main(int argc, char *argv[]) {
char *str = argv[1];
while (1) {
Spin(1);
printf("%s\n", str);
}
}
Protection
Can break the system with fork() bomb
- Limit number of processes/CPU
- Kill misbehaving programs
Interposition
- Check if accesses are legal
System Call
App invoke kernel, special HW instruction
File descriptors are inherited by child processes
OS Structures
- Monolithic: app directly above OS
- Microkernel: app with modular design, called through system calls
- Resilience to application bugs
Memory
If you can’t name it, you can’t touch it
Address translation (done in HW) ![[mem.png]]
Advantages
- Kernel-only VA
- Can’t touched by apps
- Read-only VA
- Sharing code pages (libraries)
- Inter-process communication, memory mapping
- Disable execution
System contexts
- User-level
- Kernel process context: syscall, exception handling
- Kernel code not associated with process: timer/device interrupt
- Context switch code: change the running process
- Idle
Multitasking
Context switching is not always better, if a lot of memory required
stracetells the traces of the syscalls
Linked List
- Linear movement (not good at random access)
- Circular doubly linked
#include <linux/list.h>
struct list_head {
struct list_head *next;
struct list_head *prev;
};
struct fox {
// data
struct list_head list; // value, not pointer
};
Find the parent structure with only a pointer to its field (find fox from list_head)
#define container_of(ptr, type, member) ({ const typeof( ((type *)0)->member ) *__mptr = (ptr); (type *)( (char *)__mptr - offsetof(type,member) );})
struct fox *red_fox;
red_fox = kmalloc(sizeof(*red_fox), GFP_KERNEL);
INIT_LIST_HEAD(&red_fox->list);
List Head
Special head pointer
Created with macro:
static LIST_HEAD(head);
Manipulation
#include <linux/list.h>
list_add(struct list_head *new, struct list_head *head);
// add immediately after head
list_add_tail(struct list_head *new, struct list_head *head);
// add immediately BEFORE head
list_del(struct list_head *entry);
// does not free any memory
list_del_init(struct list_head *entry);
// delete, then reinitialize
Traversal
list_for_each_entry(pos, head, member)
pos: a temporary pointer for the iterated variablehead: pointer to thelist_headnodemember: variable name ofstruct list_headinpos
Remove
list_for_each_entry_safe(pos, next, head, member)
nextused to store the next list entry, so it’s safe to remove the current entry
3. Linux
- Firmware code performs a power-on self test, pre-initializes hardware
- Firmware checks Master Boot Record (MBR) of hard disks to identify a bootable partition
- Bootloader in MBR reads configuration files, loads and launches the kernel from the disk
Booting Linux
- Kernel program starts
- Hardware configures, including VA
- Start process 0
- Virtual task
- Start scheduler
- Process 0
forks process 1 to runkernel_init - Process 0 becomes the idle task, calls schedule
- Schedule context switches to other processes from
runqueue - Process 1 will run and
execveto “/init” initis a regular program, such assystemd- At some point, we return to kernel again
Processes
- If a process’s parent dies before reaping the child, it becomes an orphan, get adopted by process
init - If child dies, before the parent reaps it, it’s a zombie process
- Becomes
<defunct>in the process tree
- Becomes
Signal
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
static void sig_int(int signo)
printf("stop pressing ctrl-c!\n");
int main()
{
if (signal(SIGINT, &sig_int) == SIG_ERR) {
perror("signal() failed");
exit(1);
}
int i = 0;
for (;;) {
printf("%d\n", i++);
sleep(1);
}
}
Kernel Modules
Kernel Module Interface to dynamically link object files into running kernel
Create a Module
#include <linux/module.h>
#include <linux/printk.h>
// called whem loaded
int hello(void) {
printk(KERN_INFO "Hello\n");
return 0; // return an error code
}
// called when removed
void goodbye(void) {
printk(KERN_INFO "Goodbye\n");
}
// register module entry and exit points
module_init(hello);
module_exit(goodbye);
// metadata
MODULE_DESCRIPTION("A basic Hello World module");
MODULE_AUTHOR("cs4118");
MODULE_LICENSE("GPL");
Build a Module
obj-m += hello.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
obj-m: object files to be built for kernelobj-y: object file to be linked to main kernel image (we’re dynamically loading)
-Cchanges the directory to/lib/modules/$(shell uname -r)/build$(shell uname -r): evaluateuname -r(kernel release) withshell
- The folder contains the kernel
Makefile M=$(PWD)allows thisMakefileto move to module source directory before trying to buildmodulesorclean
Load a Module
sudo su
insmod hello.ko
lsmod # check mods installed
rmmod hello # no need for .ko
Read the kernel log buffer:
dmesg
dmesg -c # clean the buffer
dmesg -level=warn
Fault
If a fault happens in your kernel module, kernel will jump to exception handlers
- Kernel provides two levels of fault handler.
- Beyond that, kernel crashes
- Due to limited hardware support
4.1 Signals
Process Groups
How to do other work while a process is running
proc1 | proc2 & # send to bg
[1] 7106
proc3
7106pid of the last processjobs: list all jobsCtrl-Z: suspend foreground job and send to backgroundbg <jobs>: resume<job>in bgfg <jobs>: bring from bg to fg
Sending Signals
#include <signal.h>
int kill(pid_t pid, int signo);
int raise(int signo)
- If
pid< 0, signal is sent to process group withpgid == -pid - If
pid= -1, kills everything
To foreground process group
Ctrl-CSIGINTCtrl-\SIGQUITCtrl-ZSIGTSTP
typedef void (*sighandler_t) (int);
sighandler_t signal(int signum, sighandler_t handler);
Handler can be SIG_IGN or SIG_DFL
Issues
Slow system call (read()) will be interrupted by a signal
- Fix: check
EINTRaterrnoand restartread()```c #include “apue.h”
static void sig_alrm(int);
int main(void) { int n; char line[MAXLINE];
if (signal(SIGALRM, sig_alrm) == SIG_ERR)
err_sys("signal(SIGALRM) error");
alarm(10);
if ((n = read(STDIN_FILENO, line, MAXLINE)) < 0)
err_sys("read error");
write(STDOUT_FILENO, line, n); }
static void sig_alrm(int signo) { /* nothing to do, just return to interrupt the read */ }
**Lost signal**: disposition set with `signal()` is **reset** after each signal
- Fix: set disposition again after detecting the signal
- Race condition: getting another signal before reset :(
Functions that use **static** data structures `malloc()`, `free()`, `printf()` are not **signal safe**
- Can't be called in **asynchronous** contexts
### Race Condition
```c
static void sig_alrm(int signo) { // sends -EINTR to pause()}
unsigned int sleep(unsigned int seconds) {
signal(SIGALRM, sig_alrm);
alarm(seconds); /* start the timer */
pause(); /* next caught signal wakes us up */
return alarm(0); /* turn off timer, return unslept time */
}
- Race condition before
pause() - Each alarm resets the handler
sigactionStable handler
- See https://mg-04.github.io/articles/asp/4#portable-solution for detail
4.2 File IO
[!info] Everything is a file in UNIX
- Good with portability and permissions
int open(const char *path int oflag, mode_t mode);
// need to specify mode if file is being created
- Creates an entry in the process’s File Descriptor Table (FDT)
- Offset
- Options
- Metadata
- Returns the file descriptor
int close(int fildes) - Deletes the file table entry
- If the process exits, memory leak, reclaimed by OS
- Might lose data too
off_t lseek(int fildes, off_t offset, int whence)
- Might lose data too
SEEK_SET,SEEK_CUR,SEEK_END, plusoffset- A hole in the file useful for storing sparse structures (physical < logical size)
- Returns the new position
ssize_t read(int fildes, void *buf, size_t byte);
ssize_t write(int fildes, const void *buf, size_t nbyte);
// for sockets
ssize_t recv(int socket, void *buffer, size_t length, int flags);
ssize_t send(int sociket, const void *buffer, size_t length, int flags);
- Return number of bytes actually read/written, may be less than requested
C stdio
Wrapped version in C stdio library
- Reduce the number of system calls while performing stream operations
FILE *fopen(const char *pathname, const char *mode); // open()
int fclose(FILE *stream); // close()
int fseek(FILE *stream, long offset, int whence); // lseek()
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream); // read()
size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);
Kernel Data Structure
![[file.jpg]]
- File table entry may be shared
- Independent processes have different process/file table entries, pointing to same v-nodes
- i-node specifies how files stored on disk
- v-node: API of i-node
Process Relations
- If
fork()beforeopen(),write()will overwrite each other, since each process’s file table stores its own file offset - If
open()beforefork(),write()will duplicate, since offsets are updated across the same table.
dup()
int dup(int oldfd);
Duplicate the pointer to the file table entry and return the next available file descriptor
5.1 IPC
Unnamed Pipe
A kernel object, not copied over
fork()
#include <unistd.h>
int pipe(int fd[2]);
fd[0]is opened for readingfd[1]is opened for writing- Returns 0 if OK, 1 if error
If fork()ed, the parent and child will both take the type
Connect the input and output of two processes:
- Parse
argvint main(int argc, char **argv) { int fd[2]; pid_t pid1, pid2; // Split arguments ["cmd1", ..., "--", "cmd2", ...] into // ["cmd1", ...] and ["cmd2", ...] char **argv1 = argv + 1; // argv for the first command char **argv2; // argv for the second command for (argv2 = argv1; *argv2; argv2++) { if (strcmp(*argv2, "--") == 0) { *argv2++ = NULL; break; } } - Piping and
fork()pipe(fd); if ((pid1 = fork()) == 0) { close(fd[0]); // Close read end of pipe dup2(fd[1], 1); // Redirect stdout to write end of pipe close(fd[1]); // stdout already writes to pipe, close spare fd execvp(*argv1, argv1); // Unreachable } if ((pid2 = fork()) == 0) { close(fd[1]); // Close write end of pipe dup2(fd[0], 0); // Redirect stdin from read end of pipe close(fd[0]); // stdin already reads from pipe, close spare fd execvp(*argv2, argv2); // Unreachable } - Parent handling
// Parent does not need either end of the pipe close(fd[0]); close(fd[1]); waitpid(pid1, NULL, 0); waitpid(pid2, NULL, 0); return 0; }- Need error check
./connect2 ls -ls -- grep "connect"
ls -ls | grep "connect"
[!warning] Pipe is a FIFO Data is not broadcasted, only consumed by one reader
- Atomicity up to 4K
Named Pipe
#include <sys/stat.h>
int mkfifo(const char *path, mode_t mode);
- Creates a FIFO at
path - Half-duplex
Unnamed Semaphore
sem_post(): incrementsem_wait(): wait until > 0, then decrement
Usage
- Binary: lock
- Counting: resource sharing
- Ordering
#include <semaphore.h>
int sem_init(sem_t *sem, int pshared, unsigned int value);
int sem_destroy(sem_t *sem);
- Need shared memory (
mmap()) - Or named semaphore
Named Semaphore
Created under /dev/shm
sem_t *sem_open(char *name, int oflag, ... /* mode_t mode, unsigned int value*/ );
int sem_close(sem_t *sem);
int sem_unlink(char *name); // deletes from fs
Modifying Semaphore
sem_trywait()does not blocksem_wait()blockssem_timedwait()blocks either available, or timeout- Can’t implement with
SIGALRMdue to race conditions
- Can’t implement with
Memory Mapped IO
Map file into address space
- Private: changes not flushed to disk or other processes
- Shared: other processes can see
- Anonymous: not backed by file (
fd=-1, flag=MAP_ANON)malloc(): actual allocationmmap(): reservation* of address space, protection, share, alignment
- Named
#include <sys/mman.h>
void *mmap(void *addr, size_t len, int prot, int flag, int fd, off_t off);
addr: map target- If
NULL, let OS decide
- If
len: size to allocateprot: read, write, execflagMAP_PRIVATEMAP_SHAREDMAP_ANON
fd: file we want to map- -1 for anonymous
- Returns pointer to the region
counter.c
Declare a counter with a semaphore lock
#define LOOPS 2000
struct counter {
sem_t sem;
int cnt;
};
static struct counter *counter = NULL;
static void inc_loop() {
for (int i = 0; i < LOOPS; i++) {
sem_wait(&counter->sem);
counter->cnt++;
sem_post(&counter->sem);
}
}
int main(int argc, char **argv) {
// Create a shared anonymous memory mapping, set global pointer to it
counter = mmap(
/*addr=*/NULL,
sizeof(struct counter),
PROT_READ | PROT_WRITE, // Region is readable and writable
MAP_SHARED | MAP_ANON, // share with children
/*fd=*/-1,
/*offset=*/0
);
// Mapping is already zero-initialized.
assert(counter->cnt == 0);
sem_init(&counter->sem, /*pshared=*/1, /*value=*/1);
pid_t pid;
if ((pid = fork()) == 0) {
inc_loop();
return 0;
}
inc_loop();
waitpid(pid, NULL, 0);
printf("Total count: %d, Expected: %d\n", counter->cnt, LOOPS * 2);
sem_destroy(&counter->sem);
assert(munmap(counter, sizeof(struct counter)) == 0);
return 0;
}
5.2 Threads
Share virtual memory and fd
- Each thread has its own stack
#include <pthread.h>
int pthread_create(pthread_t *thread,
const pthread_attr_t *attr,
void *(*start_routine)(void *),
void *arg);
int pthread_join(pthread_t thread, void **retval);
Mutex
#include <pthread.h>
int pthread_mutex_init(pthread_mutex_t *mutex, pthread_mutexattr_t *attr);
int pthread_mutex_destroy(pthread_mutex_t *mutex);
int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_trylock(pthread_mutex_t *mutex); // nonblocking if fail
int pthread_mutex_unlock(pthread_mutex_t *mutex);
#include <time.h>
int pthread_mutex_timedlock(pthread_mutex_t *mutex, struct timespec *tsptr);
Deadlock
- Lock same mutex twice
- Mutual lock of 2 mutexes Need strict lock ordering
Condition Variables
Signal
#include <pthread.h>
int pthread_cond_init(pthread_cond_t *restrict cond,
const pthread_condattr_t *restrict attr);
int pthread_cond_destroy(pthread_cond_t *cond);
int pthread_cond_wait(pthread_cond_t *restrict cond,
pthread_mutex_t *restrict mutex);
int pthread_cond_timedwait(pthread_cond_t *restrict cond,
pthread_mutex_t *restrict mutex,
const struct timespec *restrict tsptr);
int pthread_cond_signal(pthread_cond_t *cond);
int pthread_cond_broadcast(pthread_cond_t *cond);
#include <pthread.h>
struct msg {
struct msg *m_next;
};
struct msg *workq;
pthread_cond_t qready = PTHREAD_COND_INITIALIZER;
pthread_mutex_t qlock = PTHREAD_MUTEX_INITIALIZER;
void process_msg(void)
{
struct msg *mp;
for (;;) {
pthread_mutex_lock(&qlock);
while (workq == NULL)
pthread_cond_wait(&qready, &qlock);
mp = workq;
workq = mp->m_next;
pthread_mutex_unlock(&qlock);
}
}
void enqueue_msg(struct msg *mp)
{
pthread_mutex_lock(&qlock);
mp->m_next = workq;
workq = mp;
pthread_cond_signal(&qready);
pthread_mutex_unlock(&qlock);
}
Other Mechanisms
- Spinlock (used a lot in the kernel, for small hold time)
- Barriers
6 Advanced IO
[!info] Skim this
Nonblocking IO: io_uring: avoid polling the kernel through system calls ![[io_ring.png]]
IO Multiplexing
E. network
- Nonblocking reads alternating
- IO kernel multiplexing
select()From a fd set ```c #include <sys/select.h> int select(int maxfdp1, // max fd plus 1, or simply FD_SETSIZE /* main params */ fd_set *readfds, // see if they’re ready for reading fd_set *writefds, // see if they’re ready for writing fd_set *exceptfds, // see if exceptional condition
struct timeval *tvptr); // timeout
void FD_ZERO(fd_set *fdset); // initialize fdset void FD_SET(int fd, fd_set *fdset); // adds fd to fdset void FD_CLR(int fd, fd_set *fdset); // removes fd from fdset int FD_ISSET(int fd, fd_set *fdset); // tests if fd is in fdset
Inefficient
## `poll()`
Modern version, only gives user **one descriptor**
```c
#include <sys/select.h>
struct pollfd {
int fd; // fd assigned to fds monitor
short events; // POLLIN for read, POLLOUT for write
short reevents; // events that actually occur
};
int poll(struct pullfd fds[],
nfds_t nfds, // number of fds to monitor
int timeout);
epoll()
Edge-triggered API
Domain Sockets
- Threads: shared
- Related processes: anonymous
mmap() - Unrelated processes: file-backed
mmap()
To pass data:
- Related process: unnamed
pipe()(half-duplex) - Unrelated process: named
mkfifo()(half-duplex) - Distant process
- TCP/UDP sockets (full duplex)
Unix Domain Sockets
- TCP/UDP sockets (full duplex)
- Unnamed:
socketpair(AF_UNIX, ...);- Duplex pipe
- Named:
socket(AF_UNIX, ...);- Special file Can transport open fd
- Pass the i-node information
int socketpair(int domain, int type, int protocol, int sv[2]);
- Socket “buffers” the data from fd
recv-unix.c
int main(int argc, char **argv)
{
const char *name = argv[1];
int num_to_recv = atoi(argv[2]);
struct sockaddr_un un;
int fd, len;
// create a UNIX domain datagram socket
fd = socket(AF_UNIX, SOCK_DGRAM, 0);
// remove the socket file if exists already
unlink(name);
// fill in the socket address structure
memset(&un, 0, sizeof(un));
un.sun_family = AF_UNIX;
strcpy(un.sun_path, name);
len = offsetof(struct sockaddr_un, sun_path) + strlen(name);
// bind the name to the descriptor
if (bind(fd, (struct sockaddr *)&un, len) < 0)
die("bind failed");
char buf[num_to_recv + 1];
for (;;) {
memset(buf, 0, sizeof(buf));
int n = recv(fd, buf, num_to_recv, 0);
else
printf("%d bytes received: \"%s\"\n", n, buf);
}
close(fd);
unlink(name);
return 0;
}
send-unix.c
int main(int argc, char **argv)
{
const char *name = argv[1];
const char *msg = argv[2];
int num_repeat = atoi(argv[3]);
struct sockaddr_un un;
int fd, len, i;
// create a UNIX domain datagram socket
fd = socket(AF_UNIX, SOCK_DGRAM, 0);
// fill in the socket address structure with server's address
memset(&un, 0, sizeof(un));
un.sun_family = AF_UNIX;
strcpy(un.sun_path, name);
len = offsetof(struct sockaddr_un, sun_path) + strlen(name);
for (i = 0; i < num_repeat; i++) {
int n = sendto(fd, msg, strlen(msg), 0, (struct sockaddr *)&un, len);
printf("sent %d bytes\n", n);
}
close(fd);
return 0;
./recv-unix filename 1000
./send-unix filename hello 100
7. System Calls
Address Space
Virtually addressed
- The moment you touch something, you will get a backing in physical memory
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define BLKSZ (1<<22) // 4MB
void wait4input(const char* msg) { puts(msg); getchar(); }
int main() {
char* arr;
int i,j;
printf("mypid=%d\n",getpid());
wait4input("before big loop");
for (i=0;i<1000;i++) {
//wait4input("before malloc");
arr = malloc( BLKSZ );
for (j=0;j<BLKSZ;j++)
arr[j] = '1';
}
wait4input("done");
return 0;
}
Go to the process filesystem
cd /proc/2265
cat maps # shows entire addresss space
Get a block on the heap Modern allocators use mmap() instead of repeatedly calling brk()
- Ideally, will get a contiguous region
mmap()more flexible,brk()forces a contiguous region
Interrupts
- Hardware
- Async (network, timer)
- Exceptions
- Synchronous
- Division by 0, page fault
- Software
- Synchronous
- syscall, x86 assembly interrupt, debugger
On a read() call,
movl __NR_READ, %eax
int 0x80
ret
Now enter kernel mode
- Kernel looks
0x80in Interrupt Descriptor Table ->system_call - Jump to
system_call(registers), which access thesystem_call_table[%eax] system_call_tableis read at__NR_READ->sys_read- Jumps to
sys_read(), and do the real read
Parameters
Passed via registers. If larger, use a struct pointer (E. struct sigaction)
Convention for function calls (typical):
- Arguments passed through registers
- Sometimes as stack offsets
- Volatile: may be changed on return
- Non-volatile: must be saved and restored by callee
- Encoded in Application Binary Interface (ABI)
Need to validate buffer, need to check boundary and permissions, to avoid writing to kernel memory, since will enter kernel mode
Kernel Actions
Need to enter and return kernel mode, branch to the position of sys_read() function in the kernel ![[syscall.png]]
HW3
System Calls
asmlinkage long sys_getpid(getpid) // maxro SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current); // returns current->tgid, long
}
asmlinkagetell compiler to look only on the stack for arguments- Required for any syscalls
- Return
longin kernel andintin user space sys_*naming conventionSyscall number
Can’t be reused All registered syscalls in
sys_call_table
Implementation
- Verify parameters
- Pointers
copy_to_user(dst, src, size)writes into user spacecopy_from_user()reads from user space- Return num bytes failed to copy on error
SYSCALL_DEFINE3(silly_copy,
unsigned long *, src, unsigned long *, dst, unsigned long len)
{
unsigned long buf;
if (copy_from_user(&buf, src, len))
return -EFAULT;
if (copy_to_user(dst, &buf, len))
return -EFAULT;
return len;
}
- Check capability
SYSCALL_DEFINE4(reboot,
int, magic1, int, magic2, unsigned int, cmd, void __user *, arg)
{
if (!capable(CAP_SYS_BOOT))
return -EPERM;
// ...
}
Context
Kernel in process context during syscall execution,
currentpointer points to current task- Can sleep (E. blocking call), preemptible
Registration
- Add entry to the end of syscall table
- For each architecture
- Define syscall number in
<asm/unistd.h> - Compile into kernel image
Kernel Modules
Exported symbols are visible to any loadable module
- Dynamically linked to the kernel
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>
static int rday_1 = 1;
int rday_2 = 2;
int rday_3 = 3;
EXPORT_SYMBOL(rday_3);
static int __init hi(void) {
printk(KERN_INFO "module m2 being loaded.n");
return 0;
}
static void __exit bye(void)
printk(KERN_INFO "module m2 being unloaded.n");
module_init(hi);
module_exit(bye);
rday_1does not show in symtab, not show up in kernel spacerday_2andrday_3flagged as global data objectsrday_3has an entry in the module string table and symbol table, inksymtabandkstrtab
Need to fix syscall table and syscalls.h
8. Run/Wait Queues
Test code that times gettimeofday() calls
- Inconsistent results
- Due to scheduling and interrupts If test library calls to syscall, much slower
- User library reads from kernel directly, avoiding the actual
syscall()trap overhead Same for opening a file
Process States
In Linux, task can be a process/thread
#define TASK_RUNNING 0x0000
#define TASK_INTERRUPTIBLE 0x0001
#define TASK_UNINTERRUPTIBLE 0x0002
- Running: runnable (running, or waiting to run)
- Interruptible: sleep waiting for something, can be awakened if a signal received
- Uninterruptible: also sleep waiting, but can’t be awakened
Run Queue
A linked list of task_struct via children and sibling
Each CPU has a run_queue, links tasks with TASK_RUNNING state
- With a separate
list_head, and its own disjoint run queue
![[run_queue.png]]
Wait Queue
Not embedded in task_struct
- In fact, a parent of
task_struct - Since don’t know which element to wake up ![[wait_queue.png]]
Data Structure
struct wait_queue_head {
spin_lock_t lock;
struct list_head task_list;
};
struct waitqueue {
struct task_struct *task;
wait_queue_func_t func; // callback function for wait condition
struct list_head entry;
}
Interface
#include <linux/wait.h>
wait_event_interruptible()
// a bit outdated, doens't account for synchronization
Functions:
prepare_to_wait(): add to wait queue, change state toTASK_INTERRUPTIBLEsignal_pending(): check for signals that interrupt wakeup, if so, breakschedule(): save state and sleepfinish_wait(): change state toTASK_RUNNING, remove from wait queue
Scheduling
pick_next_task(): choose a new task from run queuecontext_switch(): put current task to sleep, and start running new task
Sleeping
wait_event(): issue a wait object- Enqueue on wait queue
- Remove from run queue
schedule()pick_next_task()context_switch()
- Run other tasks
Waking Up
- When event happens,
wake_up() - Call
try_to_wake_up()on each task - Enqueue those on run queue
- Wait for other task to call
schedule(). Choose a sleeping task - Sleeping task checks condition again. If true,
finish_wake()Never go back directly to running on CPUStates
- Running on CPU
- To sleep, ![[switch_fsm.png]]
E. read()
- Trap into kernel
- Save registers into kernel stack
- Device driver issues IO request to device
- Put calling process to sleep
- Another process starts running
- Device receives input, raise a hardware interrupt
- Trap into kernel again, enqueue the next process
- Current ask eventually calls
schedule()- Another process on the run queue starts running
Context Switch
(easier in ARM) Current task blocks OR preemption via interrupt TODO
- Release CPU registers (written in assembly)
Preemption
If a process does not give up the CPU
- Use external timer to interrupt ![[force_switch.png]]
9. Interrupts
Interrupt Contexts
Limitations: can’t sleep
- No associated task running, can’t interact with wait queue and
schedule() - Don’t know when can wake up
kmalloc(),copy_to_user()may trigger IO, can’t be called from interrupt context All handlers share one interrupt stack per processor
Handling
Idea: Defer most work for later, only time-critical ones
- Top half (first level)
- ACK the interrupt
- Fast, to make CPU get back to work
- Bottom Half (second level)
- Bulk process when CPU idle (E. large file received over network)
- Kernel tasklet, threads - Single interrupts will not nest, no need to be reentrant
- Can be interrupted by another type of interrupt
[!info] E. network package arrival
- Top half: ACK arrival, move from NIC to memory, prepare for future packets
- Bottom half: propagate through networking stack (TCP/IP), can be deferred
Sync
All concurrent data must be protected
Mutex (Sleeping)
pthread_mutex_locK(&balance_lock);
++balance;
pthread_mutex_unlock(&balance_lock);
Calling task put to sleep while waiting for lock
[!warning] Is sleeping always a good idea?
- Bad: sleep overhead
Spin Lock
[!warning] No preemption Interrupt still normal, but the task can’t be rescheduled
- With
preempt_count, only sleepable when 0
int flag = 0;
lock() {
while (flag == 0) {; }
flag = 1;
}
unlock() {
flag = 0;
}
Issues
- Multithread race condition. Need atomic
- Waste CPU cycles Both threads assume ownership: ![[Pasted image 20260224194546.png]]
Need atomic instruction test_and_set(&flag)
Hardware pseudocode:
int test_and_set(int *lock) {
int old = *lock;
*lock = 1;
return old;
}
Linux Kernel
spin_lock() and spin_unlock()
- Want keep critical sections small
- Must not lose CPU while holding lock
- Must not call any function that may sleep
- E.
kmalloc(),copy_from_user()
- E.
- Prevents kernel preemption
++preempt_count
- OK for hardware interrupt, unless handler also tries to lock
- Deadlock
- Used by process context or a single interrupt
IRQ: interrupt request
spin_lock_irqsave(), spin_lock_irqrestore()
- Lock a shared resource that may be used by interrupt handler
- Saves current interrupt state, then disable all interrupts on local CPU
- Restored later
- Used by process and interrupt contexts; or more than 1 interrupts
spin_lock_irq(), spin_unlock_irq()
Preemption
Kernel forcefully reclaim CPU
- Track a per-process
TIF_NEED_RESCHEDflag preempt_counttracks how many atomic states CPU entered. Can’t preempt if nonzero- If set, preemption by calling
schecule()when:- Return to user space (from syscall or interrupt handler)
- Return to kernel from interrupt handler, only if
preempt_countis zero preempt_count = 0, (E. right after releasing lock)- Task in kernel mode calls
schedule()itself (E. blocking syscall)
10. Synchronization
- Safety: Mutex
- Liveliness (progress):
- must allow one to proceed;
- must not depend on threads out of critical section
- Bounded waiting
- No assumptions on speed and CPU count
1. Disable Interrupts
lock() { disable_interrupt(); }
unlock() {enable_interrupt(); }
Bad:
- Privileged operations, user program can’t use
- Can’t work on multiprocessors (
*_interrupt()only applies for current CPU) - Can’t have long critical sections
2. Software Spin Lock
Peterson’s algorithm Assume:
- Load and store are atomic
- In-order execution
int flag[2] = {0, 0};
lock() {
flag[self] = 1;
while (flag[1 - self] == 1)
;
}
unlock()
flag[self] = 0;
Not live: can deadlock, race condition
int turn = 0;
lock() {
while (turn == 1 - self)
;
}
unlock()
turn = 1 - self;
Not live: Can’t enter twice in a row
Peterson
int turn = 0;
int flag[2] = {0, 0};
lock() {
flag[self] = 1; // I need lock
turn = 1 - self; // I take the turn
while (flag[1-self] == 1 && turn == 1 - self)
;
}
unlock()
flag[self] = 0;
Spin while
- Other thread needs lock, AND
- It’s other thread’s turn
turnacts as tie-breaker to prevent deadlock when both threads are “polite” - The last thread to write
turnyields to the other
[!warning] Bad if reordered Memory access (load
flag[1-self]) will typically be reordered up
3. Hardware Spin Lock
Fence
Ensure all memory operations are in-order before processing
expensive
mfence: all memorylfence: all loadssfence: all storeslock() { flag[self] = 1; sfence; while (flag[1-self] == 1 && turn = 1-self) ; }
Atomic TS
x86 Use xchgl to atomically swap two locations
long test_and_set(volatile long *lock) {
int old;
asm("xchgl %0 %1"
: "=r"(old), "+m"(*lock) // output
: "0"(1) // input
: "memory" // clobber
);
return old;
}
"=r"(old)write to (general purpose) registerold- After the swap
"+m"(*lock)read and write to memory*lock"0"(1)initializesoldto 1- Put 1 into operand
0(old, will be returned)
- Put 1 into operand
"memory"avoids compiler optimization on memory
4. Sleep Lock
lock() {
while (test_and_set(&flag))) {
add_wait_queue();
schedule();
}
// do critical section
}
unlock() {
flag = 0;
if (/* thread in wait queue */)
// wake up one thread
}
Issues
- Lost wakeup
- After calling TS, before adding to wait queue; other thread unlocks
- Wrong thread gets lock
- Newer threads may steal the lock
- Fix: make
unlock()directly transfer to waiting thread
Fix by adding a guard lock to avoid losing wakeups
prepared_to_yield()to protect the gap betweenguard = 0andyield()```c typedef struct mutex_t { int flag; int guard; queue_t *q; };
void lock(mutex_t *m) { while (test_and_set(m->guard)) // begin protected ; if (m->flag == 0) { // mutex available m->flag = 1; m->guard = 0; } else { // unavailable, sleep again enqueue(m->q, self); prepare_to_yield(); // prepare to yield m->guard = 0; // end protected yield(); } }
void unlock(mutex_t *m) { while(test_and_set(m->guard)) // begin protected ; if (queue_empty(m->q)) m->flag = 0; else // transfer to next thread DIRECTLY wakeup(dequeue(m->q)); m->guard = 0; // end protected }
## RW Lock
- **Read lock** (shared): multiple may hold
- **Write lock** (exclusive): only one can hold
```c
struct rwlock_t {
int nreader;
lock_t guard;
lock_t lock;
};
write_lock(rwlock_t *l)
lock(&l->lock);
write_unlock(rwlock_t *l)
unlock(&l->lock);
read_lock(rwlock_t *l) {
lock(&l->guard);
++nreader;
if (nreader == 1) // first reader
lock(&l->lock);
unlock(&l->guard);
}
read_unlock(rwlock_t *l) {
lock(&l->guard);
--nreader
if (nreader == 0);
unlock(&l->lock);
unlock(&l->guard);
}
[!warning] Writer will be starved Need to kick readers out for write requests
Anti-Starve
Use a writer lock, so when writer is waiting for the actual lock, no new readers can add themselves
- Existing readers will drain and release
lock```c struct rwlock_t { int nreader; lock_t guard; lock_t lock; lock_t writer; };
write_lock(rwlock_t *l) { lock(&l->writer); lock(&l->lock); unlock(&l->writer); }
read_lock(rwlock_t *l) { lock(&l->writer); lock(&l->guard); ++nreader; if (nreader == 1) // first reader lock(&l->lock); unlock(&l->guard); unlock(&l->writer); }
## Semaphore
> Now kernel context
```cpp
class Semaphore {
int lockvar; // atomicity
int value;
Queue<Thread*> waitQ;
void Init(int v);
void P(); // down
void V(); // up
}
void Semaphore::Init(int v) {
value = v;
waitQ.init();
}
If run out of resources, add yourself to wait queue
- Must be atomic
- First line disable interrupts
- Last line enable interrupts
- Spin lock, since critical
void Semaphore::P() { lock(&lockvar); value--; if (value < 0) { waitQ.add(current_thread); current_thread->status = blocked unlock(&lockvar); schedule(); } else unlock(&lockvar); }
void Semaphore::V() {
lock(&lockvar);
value++;
if (value <= 0) {
Thread *thd = waitQ.getNextThread();
scheduler->addd(thd);
}
unlock(&lockvar);
}
E. Producer-Consumer
E. Linux pipe
- Block producer when buffer full
- Block consumer if no data
Ring buffer;
Semaphore empty = N;
Semaphore full = 0;
Mutex mutex = 1;
T buffer[N];
int widx = 0, ridx = 0
Producer (T item) {
P(&empty);
P(&mutex);
buffer[widx] = item;
widx = (widx+1) % N;
V(&mutex);
V(&full);
}
// same for consumer, P(full); V(empty)
- If move buffer R/W outside of
mutex, won’t work
Barrier
Coordinate multiple threads to all reach a specific point before allowed to continue
- Rendezvous: waiting point
Critical point: no thread allowed to execute until everyone reach Rendezvous ```c rendezous: sem_wait(mutex); count++; if (count == n) sem_post(barrier); sem_post(mutex);
sem_wait(barrier); sem_post(barrier); // reusable barrier, wakes up the next thread
critical_point: sem_wait(mutex);
count–; if (count == 0) sem_wait(barrier); // comsumes the final post left floating sem_post(mutex);
Can use interlocking `barrier2` with `barrier` for `critical_points` in a loop
## RCU
Lock-free Synchronization
RW lock very **slow**
- Writes not happen often
- Doesn't scale with large number of CPUs
**Read-Copy-Update**
- Multiple readers **and** one writer and run simultaneously
- Reader may read **old**, but consistent data
For reading, make a **copy** of the shared resource
For writing, (locked) allocate a **new** resource, and shared the pointer to it
```c
struct foo {
int a;
int b;
} *global_foo;
// global_foo initialized elsewhere
DEFINE_SPINLOCK(foo_lock);
void get(int *p, int *q) {
struct foo *copy_foo;
// 1) begin reading (no lock)
rcu_read_lock();
// 2) copy pointer once
copy_foo = rcu_dereference(global_foo);
// 3) access data using copy_foo
*p = copy_foo->a;
*q = copy_foo->b;
// 4) end reading (no unlock)
rcu_read_unlock();
}
rcu_dereferencegrabs the pointer- Includes memory barrier
void set(int x, int y) {
struct foo *new_foo = kmalloc(...);
struct foo *old_foo;
// 1) synchronize multiple writers
spin_lock(&foo_lock);
// 2) copy old pointer once
old_foo = rcu_dereference_protected(
global_foo, lockdep_is_held(&foo_lock));
// 3) update data
new_foo->a = old_foo->a + x;
new_foo->b = old_foo->b + y;
// 4) switch pointer
rcu_assign_pointer(global_foo, new_foo);
spin_unlock(&foo_lock);
// 5) wait a for old readers to finish
synchronize_rcu();
// 6) free old struct
kfree(old_foo);
}
synchronize_rcu(): tells writer to wait until every single active reader has finished- safe to
kfree()after[!info] In Linux Kernel Implemented as a compiler optimization to reorder the critical sections properly
Summary
- safe to
- Spinlocks: Small, fast, non-sleeping locks used when holding time is short, such as in interrupt handlers.
- Mutexes: Sleeping locks for mutual exclusion, offering better performance for longer critical sections where sleeping is allowed.
- Semaphores: Traditional locks that can allow multiple holders, used for complex synchronization.
- Read/Write Locks (rwlock / rwsem): Allow multiple simultaneous readers but only one exclusive writer.
- Read-Copy-Update (RCU): A mechanism for high-read-frequency scenarios, allowing readers to run without locks.
Contention
Failed lock acquire Function of
- Frequency of attempts
- Lock hold time
- Number of threads
Monitor
Object with a set of monitor procedures
- One one active thread at a time
- Java
synchronized- 1 Mutex + N condition variables in a class object
Midterm Recitation
Concurrency
- Rule: All need to respect the lock
- Risk: deadlock
lock(A); lock(A);- Interleaved locks of
A,B[!warning] Prioritize HW, lectures, lecture code examples
