16 Linux Container
Published:
VM and container
VM: Guest OS kernel run on top of a host OS (with the help of specialized software) controlling the actual hardware
- Give the illusion of physical machine to the guest OS
- Processes still touch the hardware, translation may be needed.
- VMs do not know the existence of each other
- Pro: Hard separation, very secure
- Con: each VM runs an entire OS kernel: resource intensive Container: A group of processes that has its own view of the underlying hardware and OS
- One OS
- Not complete isolation
- Need to make sure processes in different containers can’t communicate
Linux container
Linux kernel has no notion of containers, only processes. Abstraction on user side
- Userspace (E. Docker or Podman) hide the complexity involved with managing containers. However, once the container is set up, it is just a group of processes to the Linux kernel.
cgroup
Resource control group
- Organize processes into hierarchal groups, whose resource usage can be monitored and limited
- Remain until system rebooted
- Managed thorough
cgroupfs
, a fake file system mounted as/sys/fs/cgroup
# Go to the mounting location of the cgroup fs (file system)
cd /sys/fs/cgroup/
# List the controllers (knobs) that are available
cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
# List the controllers that are active.
cat cgroup.subtree_control
cpuset cpu io memory pids
# Create a child cgroup
sudo su
mkdir asp
cd asp
Create child directory, specify that processes added to the asp
directory’s CPU limit is 20%.
- You can subdivide
asp
, and assign a limit to each (subdivided from the 20% limit)
# # By default, CPU usage is unbounded. Let's limit it to 20%.
cat cpu.max
max 100000
echo "20000 100000" > cpu.max # numerator / denominator
Test a stress
process and add it to cgroup
asp
# test CPU usage with stress, spines *one* CPU with sqrt
stress -c 1 & # moved to background, PID = 1000
cat croup.procs
# none now
echo 1000 > cgroup.procs
# now stress process limited to 20% CPU
# spawn another stress
streess -c 1 & # not limited rn, PID = 2000
echo 2000 > cgroup.procs
# now each running at 10%
Max number of children
cat pids.max
max # no limit for now
echo 6 > pids.max # set limit to 6
Bash’s fork()
will fail if you are running more than 6 processes
Namespace
Resource isolation: containers
- Wraps a global resource in a private (virtual) view
- E. file system, process list, network resource System call
unshare
to create a new namespace (disassociate from previous shared ones)unshare --fork --user --pid --map-root-user --mount-proc /bin/bash
--fork
: fork off/bin/bash
--user
: create a new user namespace--pid
: Create new PID namespae--map-root-user
: map the current user to root-mount-proc
: create a new mount namespace and mountprocfs
in it. The bash PID will be 1.
Now your username changes to root@spoc
. Can’t see outside processes
ps uf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 9200 5120 pts/68 S 15:36 0:00 /bin/bash
root 8 0.0 0.0 10464 3200 pts/68 R+ 15:36 0:00 ps uf
Outside the namespace:
ps uf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mg4264 1897751 0.4 0.0 9284 5248 pts/81 Ss 15:37 0:00 -bash
mg4264 1897763 0.0 0.0 10464 3200 pts/81 R+ 15:37 0:00 \_ ps uf
mg4264 1895523 0.0 0.0 9284 5120 pts/68 Ss 15:07 0:00 -bash
mg4264 1897702 0.0 0.0 6192 1792 pts/68 S 15:36 0:00 \_ unshare --fork --user --pid --map-root-user --mount-proc /bin/bash
mg4264 1897703 0.0 0.0 9200 5120 pts/68 S+ 15:36 0:00 \_ /bin/bash
normal user and different PID
Capabilities
Further limit what processes created by root
can do, in general
- Privileged process: bypass all kernel permission check
- Unprivileged process Divide the root permissions into categories: capabilities (E. change owner, kill other users’ procs) E.
sshd
divides sensitive info to child processes, and drop capabilities - In case something is hijacked, minimize damage #-##
seccomp()
System call that operates on the Secure Computing (seccomp
) state of the calling process - A lot of the syscalls are developed without namespaces. Not updated. Calling them may create vulnerabilities
- Limit the system calls you can call if you don’t need to
- E. if set
SECCOMP_SET_MODE_STRICT
: process can only callread()
,write()
, not evenopen()
- If the program hacked, hacker can’t do much
Lab 6 Zookeeper
Linux container
Part 1
Create a process using clone()
./sol-part1 /bin/bash
# execute clone to create /bin/bash
# will see /bin/bash created
- Allocate child stack with
mmap()
- Create a child process with
clone()
- Signal child to run through the pipe. Then close the pipe
waitpid()
for child to terminate- If there’s a signal, forward to parent
Part 2
Create namespaces
clone()
has a bunch of flags to create isolated namespaces./sol-part2 /bin/bash # changed to root user root@zoo
2.2
A bit of programming installing subuid
and subgid
maps
cat /etc/subuid
# gives a UID range allocated
For each user, allocate 65536 user IDs, starting from a number In root@zoo
namespace, I can create a user, but it needs to be mapped to a real user
getuid()
andgetgid()
to get current user and group IDnewuidmap
andnewgidmap
to install the mappings (writes to/proc/<container-pid>/uid_map
)
Part 3
Drop dangerous capabilities using capabilities
./sol-part3 /bin/bash
cd # have view of the entire file system
Part 4
Virtualize file system In /opt/asp
, there is a tar
file
cd /opt/asp
tar tf zoo-fs.tar
# shows entire file content (Linux distribution)
cd
mkdir zoo-fs
cd zoo-fs
(umask 0 tar xf /opt/asp/zoo-fs.tar)
# temporarily clear umask so that files/directories retain their intended permissions
# () runs in a subshell, so it doesn't affect my current umask
# extract it
sol-part4 -r ./zoo-fs
# -r: mount ./zoo-fs as root
Call mount()
to set up mount namespace
Now you would have an isolated bash shell. You can run ls
, but we have not installed cowchat
Part 5
cgroup
part5 -c 20 -r ./zoo-fs -P /bin/bash
# limits CPU usage to 20%
Stress is not yet installed. Go outside and stress
it
which stress
cp /usr/bin/stress .
# now you have stress in the namespace
stress -c 1
Outside run htop
: see stress is running, and CPU is restricted
- Our limit is 50% per user
Hard to make cgroup
s for a non-root user. Ask systemd
(init process) to provision a node in the cgroup
directory hierarchy, so you can put stuff under it
Part 6
Add networking to container Container does not have access to network devices. If you start as a user container. Podman uses slirp
that implements the entire TCP stack as a user program (instead of in the kernel)
ping mit.edu
apt install cowsay
apt install stress # (properly)