Run systemd in the container

We have been following the topic of using systemd in containers for a long time. Back in 2014, our security engineer Daniel Walsh wrote an article called Running systemd within a Docker Container , and a couple of years later another article called Running systemd in a non-privileged container , in which he stated that the situation was not very much improved. In particular, he wrote that “unfortunately, and two years later, if you google the Docker system, the first thing that pops up is the same old article of his. So it's time to change something. ” In addition, we already once talked about the conflict between the developers of Docker and systemd .

In this article we will show what has changed over the past time and how Podman can help us in this matter.

There are many reasons for running systemd inside a container, such as:

Multiservice containers - many people want to get their multiservice applications out of virtual machines and run them in containers. It would be better, of course, to break such applications into microservices, but not everyone can do it yet or there is simply no time. Therefore, launching such applications in the form of services launched by systemd from unit files makes perfect sense.
Systemd unit files — most applications running inside containers are compiled from code that previously ran on virtual or physical machines. These applications have a unit file that was written for these applications and understands how to run them. So it’s better to start the services using the supported methods, rather than hacking your own init service.
Systemd is a process manager. It manages services (shuts down, restarts services, or crawls zombie processes) better than any other tool.

There are many reasons for not running systemd in containers. The main one is that systemd / journald controls the output of containers, and tools like Kubernetes or OpenShift expect containers to write the log directly to stdout and stderr. Therefore, if you intend to manage containers through orchestration tools such as those mentioned above, then you need to seriously consider the use of containers based on systemd. In addition, the developers of Docker and Moby were often strongly opposed to using systemd in containers.

Podman's Coming

We are pleased to announce that the situation has finally moved off the ground. The team responsible for launching containers at Red Hat decided to develop their own container engine . He got the name Podman and offers the same command line interface (CLI) as Docker'a. And almost all Docker commands can be used in the same way in Podman. We often hold seminars, which are now called Change Docker to Podman , and the first slide encourages you to register: alias docker = podman.

Many do so.

My Podman and I are in no way against systemd-based containers. After all, Systemd is most often used as the Linux init-subsystem, and not allowing it to work normally in containers means ignoring the way thousands of people are used to running containers.

Podman knows what to do to get systemd working properly in the container. She needs things like mounting tmpfs on / run and / tmp. She likes when the “container” environment is enabled, and she is waiting for write permissions to her part of the cgroup directory and to the / var / log / journald folder.

When starting a container in which init or systemd is the first command, Podman automatically configures tmpfs and Cgroups so that systemd starts without problems. To block this auto start mode, use the --systemd = false option. Please note that Podman uses systemd mode only when it sees that it is necessary to execute the systemd or init command.

Here is an excerpt from the manual:

man podman run

...

–Systemd = true | false

Running the container in systemd mode. Enabled by default.

If a systemd or init command is executed inside the container, Podman will configure the tmpfs mount points in the following directories:

/ run, / run / lock, / tmp, / sys / fs / cgroup / systemd, / var / lib / journal

Also, SIGRTMIN + 3 will be used as the default stop signal.

All this allows systemd to work in a closed container without any modifications.

NOTE: systemd tries to write to the cgroup file system. However, SELinux does not allow containers to do this by default. To enable writing, enable the batch parameter container_manage_cgroup:

setsebool -P container_manage_cgroup true

Now look what the Dockerfile looks like to run systemd in a container when using Podman:

# cat Dockerfile FROM fedora RUN dnf -y install httpd; dnf clean all; systemctl enable httpd EXPOSE 80 CMD [ "/sbin/init" ]

That's all.

Now collect the container:

 # podman build -t systemd .

We tell SELinux to allow systemd to modify the configuration of Cgroups:

 # setsebool -P container_manage_cgroup true

Many, by the way, forget about this step. Fortunately, it is enough to do this only once and the configuration is saved after rebooting the system.

Now just run the container:

 # podman run -ti -p 80:80 systemd systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid) Detected virtualization container-other. Detected architecture x86-64. Welcome to Fedora 29 (Container Image)! Set hostname to <1b51b684bc99>. Failed to install release agent, ignoring: Read-only file system File /usr/lib/systemd/system/systemd-journald.service:26 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling. Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.) [ OK ] Listening on initctl Compatibility Named Pipe. [ OK ] Listening on Journal Socket (/dev/log). [ OK ] Started Forward Password Requests to Wall Directory Watch. [ OK ] Started Dispatch Password Requests to Console Directory Watch. [ OK ] Reached target Slices. … [ OK ] Started The Apache HTTP Server.

Everything, the service started and works:

 $ curl localhost <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> … </html>

NOTE: Do not try to repeat this on Docker! There, dances with a tambourine are still needed to launch such containers through a demon. (Additional fields and packages will be required for this to work seamlessly in Docker, or it will need to be run in a privileged container. See the article for details .)

A couple more cool things about Podman and systemd

Podman works better than docker in systemd unit files

If the containers need to be launched at system boot, then you can simply insert the appropriate Podman commands into the systemd unit file, that will launch the service and monitor it. Podman uses the standard fork-exec model. In other words, container processes are affiliated with the Podman process, so systemd can easily monitor them.

Docker uses the client-server model, and Docker CLI commands can also be placed directly in the unit file. However, after the Docker client connects to the Docker daemon, it (the client) becomes just another process that processes stdin and stdout. In turn, systemd has no idea about the connection between the Docker client and the container that is running the Docker daemon, and therefore, under this model, systemd cannot fundamentally monitor the service.

Systemd activation via socket

Podman correctly fulfills activation through a socket. Because Podman uses the fork-exec model, it can forward a socket to its child container processes. Docker does not know how, because it uses a client-server model.

The varlink service that Podman uses to interact with remote clients with containers is actually activated through the socket. The cockpit-podman package, written in Node.js and part of the cockpit project, allows people to interact with Podman containers through a web interface. The web daemon that cockpit-podman is running on sends messages to the varlink socket that systemd is listening on. After that, systemd activates the Podman program to receive messages and start managing containers. Activating systemd through a socket allows you to do without a constantly working daemon when implementing remote APIs.

In addition, we are developing another client for Podman, called podman-remote, which implements the same Podman CLI, but calls varlink to launch containers. Podman-remote can work on top of SSH sessions, which allows you to safely interact with containers on different machines. Over time, we plan to use podman-remote to support MacOS and Windows along with Linux, so that developers on these platforms can run the Linux virtual machine with Podman varlink running and have the full feeling that the containers are running on the local machine.

SD_NOTIFY

Systemd allows you to delay the launch of auxiliary services until the containerized service they need starts. Podman can forward the SD_NOTIFY socket to the containerized service so that the service notifies systemd of its readiness for work. And again, Docker, using the client-server model, does not know how.

In the plans

We plan to add the podman generate systemd CONTAINERID command, which will generate the systemd unit file to manage a specific container. This should work in both root and rootless modes for unprivileged containers. We even saw a request to create an OCI-compatible systemd-nspawn runtime.

Conclusion

Running systemd in a container is an understandable need. And thanks to Podman, we finally have a container launcher environment that is not hostile to systemd, but makes it easy to use.

All Articles