To isolate the processes running inside a container from its host system, container engine uses the following four features:

  • Namespaces
  • Control Groups
  • Secure Computing
  • Security-Enhanced Linux

Namespaces

Namespaces are created to limit the reach of a container to its host’s resources. It helps with security and well as limits resources available to the container.

Linux command lsns could be used for listing details of namespaces.

The namespaces essential for containers are User, Mount, Unix Timesharing System, Process ID, Network, and Inter-Process Communication.

User

The users and groups created inside a container are different from its host. Processes running inside the container as a root user could be mapped to a non-root user on the host.

Using the id command you can verify that the containers are present on a different user namespace than other processes on your host.

Running id on host:

$ id
uid=1000(username) gid=1000(username) groups=1000(username),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

Running id inside a container created from docker.io/library/httpd:latest container image:

root@14ed72afd62e:/usr/local/apache2# id
uid=0(root) gid=0(root) groups=0(root)

Process ID (PID)

Each process running on the host has a unique Process ID (PID) assigned to it. The PIDs of processes running inside container are separate from PIDs assigned by the host. Due to process ID isolation, a container can’t access the details of processes running on its host.

To fetch the list of PID namespaces you can use the command:

$ lsns -t pid
        NS TYPE NPROCS   PID USER  COMMAND
4026531836 pid     128  4996 user /usr/lib/systemd/systemd --user
4026533251 pid      16  4525 user nginx: worker process
...

Just like other processes, containers also have PIDs assigned to them by the host. You can fetch the PID of a running container using the following command:

$  docker inspect -f '{{.State.Pid}}' deploy-hugo-server-1
7172

ps aux command could be used to list the running processes on the system along with their details.

$ ps aux | grep 7172
root        7172  0.0  0.4 759596 68860 ?        Ssl  Jan22   0:32 /usr/lib/hugo/hugo server --buildFuture --bind=0.0.0.0

Mount (mnt)

By using different mount namespaces for different processes we can ensure that they won’t be able to access each other’s files.

You can use df command to view the filesystems mounted on your system.

$ df
Filesystem     1K-blocks      Used Available Use% Mounted on
devtmpfs            4096         0      4096   0% /dev
tmpfs            7849324     42384   7806940   1% /dev/shm
tmpfs            3139732      2444   3137288   1% /run
...

A container has its separate file system hierarchy which could be viewed by using df command on its shell.

root@14ed72afd62e:/usr/local/apache2# df
Filesystem     1K-blocks      Used Available Use% Mounted on
overlay        498443264 308159584 185270656  63% /
tmpfs              65536         0     65536   0% /dev
tmpfs            1569860       276   1569584   1% /etc/hosts
...

It could also be viewed by the host in the file /proc/<CONTAINER_PID>/mounts.

$ docker inspect -f '{{.State.Pid}}' deploy-hugo-server-1
7172

$ cat /proc/7172/mounts
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,seclabel,nosuid,size=65536k,mode=755,inode64 0 0
devpts /dev/pts devpts rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
...

Unix Timesharing System (UTS)

Unix Time System (UTS) namespace allows containers to have hostnames. We can verify this with the hostname command.

On the host:

[username@host-1 ~]$ hostname
host-1

Inside a container:

root@14ed72afd62e:/usr/local/apache2# hostname
14ed72afd62e

Network

Each container has a IP address and network ports assigned to it by its network namespace. It allows the developer to run multiple processes inside the container and expose them over different network ports.

To access or communicate with a process inside the container, port forwarding should be established from the host.

Inter-Process Communication (IPC)

Processes in the same IPC namespace can share the resources such as memory, semaphores, and message queues. Keeping separate IPC namespaces ensures that the processes inside a container cannot access the resources used by the host’s processes.

Time

Time namespaces are available since the release of Linux Kernel 5.6.
Maybe in the future containers can have a different time than their host.

Control Groups (Cgroups)

A control group is created to effectively allocate resources of host its processes. These Cgroups are hierarchical i.e. a child Cgroup could be spawned from the parent and it will inherit its certain attributes.

By creating a Cgroup a process in it could be prioritized, paused, removed, or resumed based on the resources allocated to it. It also helps in monitoring the resources used by particular processes.

If you are using an OS with systemd init system (to verify this you can use the command ps -p 1 -o comm=) then you can use the command systemctl list-units to list all the Cgroups. It will open a table containing the Cgroup name, state, and description. The names of the Cgroup will be in the form <parent-cgroup>.<child-cgroup> like sys-devices-platform-serial8250-tty-ttyS0.device.

To view the hierarchy of Cgroups you can use the command systemd-cgls. It presents cgroups as a tree structure.

Control group /:
-.slice
├─user.slice (#1309)
│ └─user-1000.slice (#10010)
│   ├─[email protected] (#10106)
│   │ ├─session.slice (#10394)
│   │ │ ├─dbus-broker.service (#10442)
│   │ │ │ ├─ 5088 /usr/bin/dbus-broker-launch --scope user
...

Secure Computing (Seccomp)

Using Secure Computing (or seccomp) you can disable the system calls your process can make to the host’s kernel.

A seccomp profile is a definition with a set of restricted and allowed system calls stored in a file. Default seccomp profile used by Docker: default.json.

Docker allows you to define your seccomp profile for a container in JSON format.

docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json hello-world

Security-Enhanced Linux (SELinux)

SELinux is a security architecture for GNU/Linux-based OS that defines access to files and processes. It is enforced on users or processes to restrict their access to the resources.

SELinux checks the SELinux context of the file or process to make decisions related to its access control. To view the SELinux context of a file use command ls -Z <FILENAME> and to view it for a process using the command ps -eZ | grep <PROCESS_NAME>.


Thank you for taking the time to read this blog post! If you found this content valuable and would like to stay updated with my latest posts consider subscribing to my RSS Feed.

Resources

Cgroups, namespaces, and beyond: what are containers made from?
4 Linux technologies fundamental to containers
systemd(1) — Linux manual page
The 7 most used Linux namespaces
Inter-process communication in Linux: Shared storage
DKER-EE-001250 - The Docker Enterprise hosts IPC namespace must not be shared.
It’s Finally Time: The Time Namespace Support Has Been Added To The Linux 5.6 Kernel
Obtaining Information about Control Groups
seccomp(2) — Linux manual page
What is SELinux?
Podman volumes and SELinux