16 Linux Container

7 minute read

Published:

VM and container

VM: Guest OS kernel run on top of a host OS (with the help of specialized software) controlling the actual hardware

  • Give the illusion of physical machine to the guest OS
  • Processes still touch the hardware, translation may be needed.
  • VMs do not know the existence of each other
  • Pro: Hard separation, very secure
  • Con: each VM runs an entire OS kernel: resource intensive Container: A group of processes that has its own view of the underlying hardware and OS
  • One OS
  • Not complete isolation
  • Need to make sure processes in different containers can’t communicate

Linux container

Linux kernel has no notion of containers, only processes. Abstraction on user side

  • Userspace (E. Docker or Podman) hide the complexity involved with managing containers. However, once the container is set up, it is just a group of processes to the Linux kernel.

cgroup

Resource control group

  • Organize processes into hierarchal groups, whose resource usage can be monitored and limited
  • Remain until system rebooted
  • Managed thorough cgroupfs, a fake file system mounted as /sys/fs/cgroup
# Go to the mounting location of the cgroup fs (file system)
cd /sys/fs/cgroup/

# List the controllers (knobs) that are available
cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

# List the controllers that are active.
cat cgroup.subtree_control 
cpuset cpu io memory pids

# Create a child cgroup
sudo su
mkdir asp
cd asp

Create child directory, specify that processes added to the asp directory’s CPU limit is 20%.

  • You can subdivide asp, and assign a limit to each (subdivided from the 20% limit)
# # By default, CPU usage is unbounded. Let's limit it to 20%.
cat cpu.max
max 100000
echo "20000 100000" > cpu.max    # numerator / denominator

Test a stress process and add it to cgroup asp

# test CPU usage with stress, spines *one* CPU with sqrt
stress -c 1 &       # moved to background, PID = 1000

cat croup.procs     
                    # none now
echo 1000 > cgroup.procs
# now stress process limited to 20% CPU

# spawn another stress
streess -c 1 &      # not limited rn, PID = 2000
echo 2000 > cgroup.procs
# now each running at 10%

Max number of children

cat pids.max
max          # no limit for now

echo 6 > pids.max  # set limit to 6

Bash’s fork() will fail if you are running more than 6 processes

Namespace

Resource isolation: containers

  • Wraps a global resource in a private (virtual) view
  • E. file system, process list, network resource System call unshare to create a new namespace (disassociate from previous shared ones)
    unshare --fork --user --pid --map-root-user --mount-proc /bin/bash
    
  • --fork: fork off /bin/bash
  • --user: create a new user namespace
  • --pid: Create new PID namespae
  • --map-root-user: map the current user to root
  • -mount-proc: create a new mount namespace and mount procfs in it. The bash PID will be 1.

Now your username changes to root@spoc. Can’t see outside processes

ps uf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   9200  5120 pts/68   S    15:36   0:00 /bin/bash
root           8  0.0  0.0  10464  3200 pts/68   R+   15:36   0:00 ps uf

Outside the namespace:

ps uf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mg4264   1897751  0.4  0.0   9284  5248 pts/81   Ss   15:37   0:00 -bash
mg4264   1897763  0.0  0.0  10464  3200 pts/81   R+   15:37   0:00  \_ ps uf
mg4264   1895523  0.0  0.0   9284  5120 pts/68   Ss   15:07   0:00 -bash
mg4264   1897702  0.0  0.0   6192  1792 pts/68   S    15:36   0:00  \_ unshare --fork --user --pid --map-root-user --mount-proc /bin/bash
mg4264   1897703  0.0  0.0   9200  5120 pts/68   S+   15:36   0:00      \_ /bin/bash

normal user and different PID

Capabilities

Further limit what processes created by root can do, in general

  • Privileged process: bypass all kernel permission check
  • Unprivileged process Divide the root permissions into categories: capabilities (E. change owner, kill other users’ procs) E. sshd divides sensitive info to child processes, and drop capabilities
  • In case something is hijacked, minimize damage #-## seccomp() System call that operates on the Secure Computing (seccomp) state of the calling process
  • A lot of the syscalls are developed without namespaces. Not updated. Calling them may create vulnerabilities
  • Limit the system calls you can call if you don’t need to
  • E. if set SECCOMP_SET_MODE_STRICT: process can only call read(), write(), not even open()
    • If the program hacked, hacker can’t do much

Lab 6 Zookeeper

Linux container

Part 1

Create a process using clone()

./sol-part1 /bin/bash
# execute clone to create /bin/bash
# will see /bin/bash created
  1. Allocate child stack with mmap()
  2. Create a child process with clone()
  3. Signal child to run through the pipe. Then close the pipe
  4. waitpid() for child to terminate
    • If there’s a signal, forward to parent

Part 2

Create namespaces

  • clone() has a bunch of flags to create isolated namespaces
    ./sol-part2 /bin/bash
    # changed to root user root@zoo
    

2.2

A bit of programming installing subuid and subgid maps

cat /etc/subuid
# gives a UID range allocated

For each user, allocate 65536 user IDs, starting from a number In root@zoo namespace, I can create a user, but it needs to be mapped to a real user

  1. getuid() and getgid() to get current user and group ID
  2. newuidmap and newgidmap to install the mappings (writes to /proc/<container-pid>/uid_map)

Part 3

Drop dangerous capabilities using capabilities

./sol-part3 /bin/bash
cd   # have view of the entire file system

Part 4

Virtualize file system In /opt/asp, there is a tar file

cd /opt/asp
tar tf zoo-fs.tar
# shows entire file content (Linux distribution)

cd
mkdir zoo-fs
cd zoo-fs
(umask 0 tar xf /opt/asp/zoo-fs.tar)
# temporarily clear umask so that files/directories retain their intended permissions
# () runs in a subshell, so it doesn't affect my current umask
    # extract it

sol-part4 -r ./zoo-fs
# -r: mount ./zoo-fs as root

Call mount() to set up mount namespace

Now you would have an isolated bash shell. You can run ls, but we have not installed cowchat

Part 5

cgroup

part5 -c 20 -r ./zoo-fs -P /bin/bash
# limits CPU usage to 20%

Stress is not yet installed. Go outside and stress it

which stress
cp /usr/bin/stress .
# now you have stress in the namespace
stress -c 1

Outside run htop: see stress is running, and CPU is restricted

  • Our limit is 50% per user

Hard to make cgroups for a non-root user. Ask systemd (init process) to provision a node in the cgroup directory hierarchy, so you can put stuff under it

Part 6

Add networking to container Container does not have access to network devices. If you start as a user container. Podman uses slirp that implements the entire TCP stack as a user program (instead of in the kernel)

ping mit.edu
apt install cowsay
apt install stress # (properly)