Implementation of runc as of November 2024
December 8, 2024runc is a container runtime with a command-line interface tool. Although runc was separated from Docker in 2015, Docker still uses runc because the Docker Engine is built on top of containerd, which incorporates runc. In this post, I will explain some Linux features used in the implementation of runc.
Using runc
runc doesn’t require any Docker images to start a container. It can use local directories as the root file systems for containers. For example, you can start a container with the following commands:
mkdir rootfs
# export busybox from Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -
# generate a specification in the format of ./config.json
runc spec
# create the container and attach to the launched /bin/sh in the container
runc --root /tmp/runc run
Before starting the container, the root directory, /
, in the busybox has been copied into the host.
Then runc attaches to sh
, which is defined as the entry point in config.json
.
Implementation of runc run
run
is one of the runc
subcommands, and creates then starts a container.
The following diagram shows the procedure of the runc run
process.
run
asynchronously executes the runc init
command.
The invoked init
process eventually becomes the PID 1 process of args
in config.json
by execve
.
tty of a container
A pseudoterminal (pty
) is a pair of bidirectionally communicable virtual character devices.
One is the master, with its path being /dev/ptmx
, and the other is the slave.
When a process opens /dev/ptmx
, the slave is created in /dev/pts
directory.
the init
process creates a master and slave and sends the file descriptor of the master through a socket pair.
The run
process calls setupIO
to create the socket pair by socketpair
:
parent, child, err := utils.NewSockPair("console")
if err != nil {
return nil, err
}
The run
process places the child
socket in the ExtraFiles
field of [Cmd
] which starts the init
process. It binds p.ConsoleSocket
to the child
socket in the ExtraFiles
, assigning the file descriptor with _LIBCONTAINER_CONSOLE
as shown in the part of the newParentProcess function:
cmd.ExtraFiles = append(cmd.ExtraFiles, p.ExtraFiles...)
if p.ConsoleSocket != nil {
cmd.ExtraFiles = append(cmd.ExtraFiles, p.ConsoleSocket)
cmd.Env = append(cmd.Env,
"_LIBCONTAINER_CONSOLE="+strconv.Itoa(stdioFdCount+len(cmd.ExtraFiles)-1),
)
}
The init
process opens the /dev/ptmx
and writes its file descriptor to the child
socket in the SendRawFd
:
// SendRawFd sends a specific file descriptor over the given AF_UNIX socket.
func SendRawFd(socket *os.File, msg string, fd uintptr) error {
oob := unix.UnixRights(int(fd))
return unix.Sendmsg(int(socket.Fd()), []byte(msg), oob, nil, 0)
}
The init
process duplicates the slave to standard IO and standard error in the following dupStdio
:
// dupStdio opens the slavePath for the console and dups the fds to the current
// processes stdio, fd 0,1,2.
func dupStdio(slavePath string) error {
fd, err := unix.Open(slavePath, unix.O_RDWR, 0)
if err != nil {
return &os.PathError{
Op: "open",
Path: slavePath,
Err: err,
}
}
for _, i := range []int{0, 1, 2} {
if err := unix.Dup3(fd, i, 0); err != nil {
return err
}
}
return nil
}
The init
process also makes the slave the controlling terminal through Setctty
:
func Setctty() error {
if err := unix.IoctlSetInt(0, unix.TIOCSCTTY, 0); err != nil {
return err
}
return nil
}
Meanwhile, the run
process monitors the master by epoll
in the recvtty
.
It receives messages by the recvmsg
, copies the stream read from the master to standard output, and transferring standard input of the run
process to the master:
go func() { _ = epoller.Wait() }()
go func() { _, _ = io.Copy(epollConsole, os.Stdin) }()
t.wg.Add(1)
go t.copyIO(os.Stdout, epollConsole)
Namespaces
A namespace is an abstraction that makes it appear to the processes within the namespace that they own the global system resource. Processes in the same namespace share resources, and are not affected by resource consumption by the processes in other namespaces. For instance, the same PID can be assigned to processes in two different namespaces.
clone
, setns
and unshare
are the APIs to change the namespaces of the processes.
clone
creates a new process and forms a new namespace, placing the child process within that namespace.
setns
moves the thread that has called setns
into an existing namespace.
unshare
creates a new namespace and moves the calling process into it.
The nsexec
function, written in C, changes the namespace of the init
process.
It is invoked before the Go runtime starts by the following snippet in the nsenter.go
:
//go:build linux && !gccgo
package nsenter
/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
nsexec();
}
*/
import "C"
The README.md of nsenter
explains that managing the namespaces of the init
process is is implemented in C runtime because there can be more than one thread in the Go runtime.
setns
switches only the calling thread’s namespace, so it needs to be called on all threads if there are multiple.
nsexec
calls clone
twice to alter the namespace of the init
process.
The following diagram illustrates the procedure of the nsexec
function:
Only the process created last executes the return
statement, leading to the Go main
function.
The rest of the precedent init
processes terminate without starting the Go runtime.
nsexec
executes clone
to assign PID 1 to the init
process within a namespace.
unshare
can create a new namespace, but cannot alter the calling process’s pid_namespace.
Passing the CLONE_PARENT
flag to clone
makes the new child init
process’s parent the parent process of the caller.
Since run
is the parent of the leftmost init
process in the diagram, the last created init
process that executes the return
statement and are followed by the Go main
function has the run
process as a parent.
Consequently the SIGCHLD
of the init
process will be delivered to the run
process that waits for the signal.
The second init
process in the diaram join exsisting namespaces using setns
or moves to new namespaces by unshare
.
The /proc/<pid>/uid_map
and /proc/<pid>/gid_map
define the mapping of the UID and GID between the namespace of the run
process and that of the init
process.
The centrally positioned init
process, initially cloned, participates in existing namespaces using setns
or moves to newly created namespaces via unshare
.
For instance the file owner of a file created by the init
process can be seen as the corresponding user defined in the mapping from the processes in the run
processe’s namespace.
Root Filesystem
pivot_root
changes the init
process’s root filesystem.
Below is a code snippet illustrating the execution of the pivot_root
, setting the top directory of rootfs
as the root filesystem:
cd rootfs && mkdir put_old
# run /bin/sh in a new namespace
unshare -mpfr /bin/sh
# the first argument of pivot_root must be a mount point
mount --bind $(pwd) $(pwd)
# Place the original root in put_old. Can be unmounted later
pivot_root $(pwd) $(pwd)/put_old
Differences Between Host Processes and Containers
The init
process is created by the run
process, and transforms into a container.
Containers can beee seen as just processes that are isolated by namespaces and pivot_root
.