16 Linux Container

7 minute read

Published: July 25, 2025

VM and container

VM: Guest OS kernel run on top of a host OS (with the help of specialized software) controlling the actual hardware

Give the illusion of physical machine to the guest OS
Processes still touch the hardware, translation may be needed.
VMs do not know the existence of each other
Pro: Hard separation, very secure
Con: each VM runs an entire OS kernel: resource intensive Container: A group of processes that has its own view of the underlying hardware and OS
One OS
Not complete isolation
Need to make sure processes in different containers can’t communicate

Linux container

Linux kernel has no notion of containers, only processes. Abstraction on user side

Userspace (E. Docker or Podman) hide the complexity involved with managing containers. However, once the container is set up, it is just a group of processes to the Linux kernel.

`cgroup`

Resource control group

Organize processes into hierarchal groups, whose resource usage can be monitored and limited
Remain until system rebooted
Managed thorough cgroupfs, a fake file system mounted as /sys/fs/cgroup

# Go to the mounting location of the cgroup fs (file system)
cd /sys/fs/cgroup/

# List the controllers (knobs) that are available
cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

# List the controllers that are active.
cat cgroup.subtree_control 
cpuset cpu io memory pids

# Create a child cgroup
sudo su
mkdir asp
cd asp

Create child directory, specify that processes added to the asp directory’s CPU limit is 20%.

You can subdivide asp, and assign a limit to each (subdivided from the 20% limit)

# # By default, CPU usage is unbounded. Let's limit it to 20%.
cat cpu.max
max 100000
echo "20000 100000" > cpu.max    # numerator / denominator

Test a stress process and add it to cgroup asp

# test CPU usage with stress, spines *one* CPU with sqrt
stress -c 1 &       # moved to background, PID = 1000

cat croup.procs     
                    # none now
echo 1000 > cgroup.procs
# now stress process limited to 20% CPU

# spawn another stress
streess -c 1 &      # not limited rn, PID = 2000
echo 2000 > cgroup.procs
# now each running at 10%

Max number of children

cat pids.max
max          # no limit for now

echo 6 > pids.max  # set limit to 6

Bash’s fork() will fail if you are running more than 6 processes

Namespace

Resource isolation: containers

Wraps a global resource in a private (virtual) view
E. file system, process list, network resource System call unshare to create a new namespace (disassociate from previous shared ones)
```
unshare --fork --user --pid --map-root-user --mount-proc /bin/bash
```
--fork: fork off /bin/bash
--user: create a new user namespace
--pid: Create new PID namespae
--map-root-user: map the current user to root
-mount-proc: create a new mount namespace and mount procfs in it. The bash PID will be 1.

Now your username changes to root@spoc. Can’t see outside processes

ps uf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   9200  5120 pts/68   S    15:36   0:00 /bin/bash
root           8  0.0  0.0  10464  3200 pts/68   R+   15:36   0:00 ps uf

Outside the namespace:

ps uf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mg4264   1897751  0.4  0.0   9284  5248 pts/81   Ss   15:37   0:00 -bash
mg4264   1897763  0.0  0.0  10464  3200 pts/81   R+   15:37   0:00  \_ ps uf
mg4264   1895523  0.0  0.0   9284  5120 pts/68   Ss   15:07   0:00 -bash
mg4264   1897702  0.0  0.0   6192  1792 pts/68   S    15:36   0:00  \_ unshare --fork --user --pid --map-root-user --mount-proc /bin/bash
mg4264   1897703  0.0  0.0   9200  5120 pts/68   S+   15:36   0:00      \_ /bin/bash

normal user and different PID

Capabilities

Further limit what processes created by root can do, in general

Privileged process: bypass all kernel permission check
Unprivileged process Divide the root permissions into categories: capabilities (E. change owner, kill other users’ procs) E. sshd divides sensitive info to child processes, and drop capabilities
In case something is hijacked, minimize damage #-## seccomp() System call that operates on the Secure Computing (seccomp) state of the calling process
A lot of the syscalls are developed without namespaces. Not updated. Calling them may create vulnerabilities
Limit the system calls you can call if you don’t need to
E. if set SECCOMP_SET_MODE_STRICT: process can only call read(), write(), not even open()
- If the program hacked, hacker can’t do much

Lab 6 Zookeeper

Linux container

Part 1

Create a process using clone()

./sol-part1 /bin/bash
# execute clone to create /bin/bash
# will see /bin/bash created

Allocate child stack with mmap()
Create a child process with clone()
Signal child to run through the pipe. Then close the pipe
waitpid() for child to terminate
- If there’s a signal, forward to parent

Part 2

Create namespaces

clone() has a bunch of flags to create isolated namespaces
```
./sol-part2 /bin/bash
# changed to root user root@zoo
```

2.2

A bit of programming installing subuid and subgid maps

cat /etc/subuid
# gives a UID range allocated

For each user, allocate 65536 user IDs, starting from a number In root@zoo namespace, I can create a user, but it needs to be mapped to a real user

getuid() and getgid() to get current user and group ID
newuidmap and newgidmap to install the mappings (writes to /proc/<container-pid>/uid_map)

Part 3

Drop dangerous capabilities using capabilities

./sol-part3 /bin/bash
cd   # have view of the entire file system

Part 4

Virtualize file system In /opt/asp, there is a tar file

cd /opt/asp
tar tf zoo-fs.tar
# shows entire file content (Linux distribution)

cd
mkdir zoo-fs
cd zoo-fs
(umask 0 tar xf /opt/asp/zoo-fs.tar)
# temporarily clear umask so that files/directories retain their intended permissions
# () runs in a subshell, so it doesn't affect my current umask
    # extract it

sol-part4 -r ./zoo-fs
# -r: mount ./zoo-fs as root

Call mount() to set up mount namespace

Now you would have an isolated bash shell. You can run ls, but we have not installed cowchat

Part 5

cgroup

part5 -c 20 -r ./zoo-fs -P /bin/bash
# limits CPU usage to 20%

Stress is not yet installed. Go outside and stress it

which stress
cp /usr/bin/stress .
# now you have stress in the namespace

stress -c 1

Outside run htop: see stress is running, and CPU is restricted

Our limit is 50% per user

Hard to make cgroups for a non-root user. Ask systemd (init process) to provision a node in the cgroup directory hierarchy, so you can put stuff under it

Part 6

Add networking to container Container does not have access to network devices. If you start as a user container. Podman uses slirp that implements the entire TCP stack as a user program (instead of in the kernel)

ping mit.edu
apt install cowsay
apt install stress # (properly)

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Ming Gong

16 Linux Container

VM and container

Linux container

`cgroup`

Namespace

Capabilities

Lab 6 Zookeeper

Part 1

Part 2

2.2

Part 3

Part 4

Part 5

Part 6

Share on

You May Also Enjoy

42030青春版–一套刚刚好的乐高

Embedded Project: CircuitSim

Demystifying Cache–From Bytes to Tags

So What Is Carry Look Ahead