Implementation of runc as of November 2024

December 8, 2024

runc is a container runtime with a command-line interface tool. Although runc was separated from Docker in 2015, Docker still uses runc because the Docker Engine is built on top of containerd, which incorporates runc. In this post, I will explain some Linux features used in the implementation of runc.

Using runc

runc doesn’t require any Docker images to start a container. It can use local directories as the root file systems for containers. For example, you can start a container with the following commands:

mkdir rootfs
# export busybox from Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -
# generate a specification in the format of ./config.json
runc spec
# create the container and attach to the launched /bin/sh in the container
runc --root /tmp/runc run

Before starting the container, the root directory, /, in the busybox has been copied into the host. Then runc attaches to sh, which is defined as the entry point in config.json.

Implementation of `runc run`

run is one of the runc subcommands, and creates then starts a container. The following diagram shows the procedure of the runc run process. run asynchronously executes the runc init command. The invoked init process eventually becomes the PID 1 process of args in config.json by execve. overview

tty of a container

A pseudoterminal (pty) is a pair of bidirectionally communicable virtual character devices. One is the master, with its path being /dev/ptmx, and the other is the slave. When a process opens /dev/ptmx, the slave is created in /dev/pts directory.

the init process creates a master and slave and sends the file descriptor of the master through a socket pair. The run process calls setupIO to create the socket pair by socketpair:

parent, child, err := utils.NewSockPair("console")
if err != nil {
    return nil, err
}

The run process places the child socket in the ExtraFiles field of [Cmd] which starts the init process. It binds p.ConsoleSocket to the child socket in the ExtraFiles, assigning the file descriptor with _LIBCONTAINER_CONSOLE as shown in the part of the newParentProcess function:

cmd.ExtraFiles = append(cmd.ExtraFiles, p.ExtraFiles...)
if p.ConsoleSocket != nil {
	cmd.ExtraFiles = append(cmd.ExtraFiles, p.ConsoleSocket)
	cmd.Env = append(cmd.Env,
		"_LIBCONTAINER_CONSOLE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
	)
}

The init process opens the /dev/ptmx and writes its file descriptor to the child socket in the SendRawFd:

// SendRawFd sends a specific file descriptor over the given AF_UNIX socket.
func SendRawFd(socket *os.File, msg string, fd uintptr) error {
	oob := unix.UnixRights(int(fd))
	return unix.Sendmsg(int(socket.Fd()), []byte(msg), oob, nil, 0)
}

The init process duplicates the slave to standard IO and standard error in the following dupStdio:

// dupStdio opens the slavePath for the console and dups the fds to the current
// processes stdio, fd 0,1,2.
func dupStdio(slavePath string) error {
	fd, err := unix.Open(slavePath, unix.O_RDWR, 0)
	if err != nil {
		return &os.PathError{
			Op:   "open",
			Path: slavePath,
			Err:  err,
		}
	}
	for _, i := range []int{0, 1, 2} {
		if err := unix.Dup3(fd, i, 0); err != nil {
			return err
		}
	}
	return nil
}

The init process also makes the slave the controlling terminal through Setctty:

func Setctty() error {
	if err := unix.IoctlSetInt(0, unix.TIOCSCTTY, 0); err != nil {
		return err
	}
	return nil
}

Meanwhile, the run process monitors the master by epoll in the recvtty. It receives messages by the recvmsg, copies the stream read from the master to standard output, and transferring standard input of the run process to the master:

go func() { _ = epoller.Wait() }()
go func() { _, _ = io.Copy(epollConsole, os.Stdin) }()
t.wg.Add(1)
go t.copyIO(os.Stdout, epollConsole)

Namespaces

A namespace is an abstraction that makes it appear to the processes within the namespace that they own the global system resource. Processes in the same namespace share resources, and are not affected by resource consumption by the processes in other namespaces. For instance, the same PID can be assigned to processes in two different namespaces.

clone, setns and unshare are the APIs to change the namespaces of the processes. clone creates a new process and forms a new namespace, placing the child process within that namespace. setns moves the thread that has called setns into an existing namespace. unshare creates a new namespace and moves the calling process into it.

The nsexec function, written in C, changes the namespace of the init process. It is invoked before the Go runtime starts by the following snippet in the nsenter.go:

//go:build linux && !gccgo

package nsenter

/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
	nsexec();
}
*/
import "C"

The README.md of nsenter explains that managing the namespaces of the init process is is implemented in C runtime because there can be more than one thread in the Go runtime. setns switches only the calling thread’s namespace, so it needs to be called on all threads if there are multiple.

nsexec calls clone twice to alter the namespace of the init process. The following diagram illustrates the procedure of the nsexec function: nsenter

Only the process created last executes the return statement, leading to the Go main function. The rest of the precedent init processes terminate without starting the Go runtime. nsexec executes clone to assign PID 1 to the init process within a namespace. unshare can create a new namespace, but cannot alter the calling process’s pid_namespace.

Passing the CLONE_PARENT flag to clone makes the new child init process’s parent the parent process of the caller. Since run is the parent of the leftmost init process in the diagram, the last created init process that executes the return statement and are followed by the Go main function has the run process as a parent. Consequently the SIGCHLD of the init process will be delivered to the run process that waits for the signal.

The second init process in the diaram join exsisting namespaces using setns or moves to new namespaces by unshare. The /proc/<pid>/uid_map and /proc/<pid>/gid_map define the mapping of the UID and GID between the namespace of the run process and that of the init process. The centrally positioned init process, initially cloned, participates in existing namespaces using setns or moves to newly created namespaces via unshare. For instance the file owner of a file created by the init process can be seen as the corresponding user defined in the mapping from the processes in the run processe’s namespace.

Root Filesystem

pivot_root changes the init process’s root filesystem. Below is a code snippet illustrating the execution of the pivot_root, setting the top directory of rootfs as the root filesystem:

cd rootfs && mkdir put_old
# run /bin/sh in a new namespace
unshare -mpfr /bin/sh
# the first argument of pivot_root must be a mount point
mount --bind $(pwd) $(pwd)
# Place the original root in put_old. Can be unmounted later
pivot_root $(pwd) $(pwd)/put_old

Differences Between Host Processes and Containers

The init process is created by the run process, and transforms into a container. Containers can beee seen as just processes that are isolated by namespaces and pivot_root.

Blanket

Implementation of runc as of November 2024

Using runc

Implementation of `runc run`

tty of a container

Namespaces

Root Filesystem

Differences Between Host Processes and Containers

References

Implementation of runc as of November 2024

Using runc

Implementation of runc run

tty of a container

Namespaces

Root Filesystem

Differences Between Host Processes and Containers

References

Implementation of `runc run`