We have been following the topic of using systemd in containers for a long time. Back in 2014, our security engineer Daniel Walsh wrote an article
In this article, we will show what has changed since then and how Podman can help us in this matter.
There are many reasons to run systemd inside a container, such as:
- Multiservice containers β many people want to pull their multi-service applications out of virtual machines and run them in containers. It would be better, of course, to break such applications into microservices, but not everyone knows how to do this yet or there is simply no time. So running these applications as services run by systemd from unit files makes perfect sense.
- Systemd unit files - most applications running inside containers are built from code that was previously run on virtual or physical machines. These applications have a unit file that was written for these applications and understands how they should be run. So it's still better to start services using supported methods rather than hacking your own init service.
- Systemd is a process manager. It manages services (shuts down, restarts services, or kills zombies) better than any other tool.
That being said, there are plenty of reasons not to run systemd in containers. The main one is that systemd/journald controls the output of containers, and tools like
Podman's coming
We are pleased to announce that the situation has finally moved forward. The team responsible for running containers at Red Hat decided to develop
Many do so.
My Podman and I are in no way against systemd-based containers. After all, Systemd is more often used as the init subsystem of Linux, and preventing it from working properly in containers means ignoring the way thousands of people are used to running containers.
Podman knows what needs to be done to make systemd work properly in a container. She needs things like mounting tmpfs on /run and /tmp. She likes to have a "containerized" environment enabled and waits for write permissions to her part of the cgroup directory and to the /var/log/journald folder.
When starting a container with init or systemd as the first command, Podman automatically configures tmpfs and Cgroups so that systemd starts up smoothly. To disable this autostart mode, use the --systemd=false option. Note that Podman only uses systemd mode when it sees that it needs to run a systemd or init command.
Here is an excerpt from the manual:
man podman run
...βsystemd=true|false
Running a container in systemd mode. Enabled by default.
If a systemd or init command is running inside a container, Podman will set up tmpfs mount points in the following directories:
/run, /run/lock, /tmp, /sys/fs/cgroup/systemd, /var/lib/journal
Also SIGRTMIN+3 will be used as the default stop signal.
All this allows systemd to run in a closed container without any modifications.
NOTE: systemd tries to write to the cgroup file system. However, SELinux prevents containers from doing this by default. To allow writing, enable the container_manage_cgroup boolean parameter:
setsebool -P container_manage_cgroup true
Now see what the Dockerfile looks like for running systemd in a container using Podman:
# cat Dockerfile
FROM fedora
RUN dnf -y install httpd; dnf clean all; systemctl enable httpd
EXPOSE 80
CMD [ "/sbin/init" ]
That's all.
Now we collect the container:
# podman build -t systemd .
Tell SELinux to allow systemd to modify the Cgroups configuration:
# setsebool -P container_manage_cgroup true
Many, by the way, forget about this step. Fortunately, it is enough to do this only once and the setting is saved after a system reboot.
Now we just start the container:
# podman run -ti -p 80:80 systemd
systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization container-other.
Detected architecture x86-64.
Welcome to Fedora 29 (Container Image)!
Set hostname to <1b51b684bc99>.
Failed to install release agent, ignoring: Read-only file system
File /usr/lib/systemd/system/systemd-journald.service:26 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Reached target Slices.
β¦
[ OK ] Started The Apache HTTP Server.
That's it, the service is up and running:
$ curl localhost
<html xml_lang="en" lang="en">
β¦
</html>
NOTE: Don't try this on Docker! There still need to dance with a tambourine to run this kind of containers through a demon. (Additional fields and packages will be required to make this work seamlessly in Docker, or it will need to be run in a privileged container. See details in
A couple more cool things about Podman and systemd
Podman performs better than Docker in systemd unit files
If containers need to be started at system boot, then you can simply insert the appropriate Podman commands into the systemd unit file, which will start the service and monitor it. Podman uses the standard fork-exec model. In other words, container processes are children of the Podman process, so systemd can easily monitor them.
Docker uses a client-server model, and Docker CLI commands can also be placed directly in a unit file. However, after the Docker client connects to the Docker daemon, it (the client) becomes just another process handling stdin and stdout. In turn, systemd has no idea about the connection between the Docker client and the container that runs the Docker daemon, and therefore, within this model, systemd cannot monitor the service in principle.
Activating systemd via socket
Podman handles socket activation correctly. Because Podman uses the fork-exec model, it can forward the socket to its child container processes. Docker can't do that because it uses a client-server model.
The varlink service that Podman uses to allow remote clients to communicate with containers is actually invoked over a socket. The cockpit-podman package, written in Node.js and part of the cockpit project, allows people to interact with Podman containers through a web interface. The web daemon running cockpit-podman sends messages to the varlink socket that systemd is listening on. Systemd then activates the Podman program to receive messages and start managing containers. Activating systemd over a socket eliminates the need for a constantly running daemon when implementing remote APIs.
We are also developing another Podman client called podman-remote which implements the same Podman CLI but calls varlink to start containers. Podman-remote can run on top of SSH sessions, allowing you to securely interact with containers on different machines. Over time, we plan to use podman-remote to support MacOS and Windows along with Linux, so that developers on those platforms can run a Linux virtual machine running Podman varlink and have the full feeling that containers are running on a local machine.
SD_NOTIFY
Systemd allows you to delay the start of auxiliary services until the containerized service they need starts. Podman can forward an SD_NOTIFY socket to a containerized service so that the service notifies systemd that it's ready to go. And again, Docker, which uses the client-server model, does not know how.
In the plans
We plan to add a podman generate systemd CONTAINERID command that will generate a systemd unit file to manage a given container. This should work in both root and rootless modes for unprivileged containers. We even saw a request for an OCI compliant systemd-nspawn runtime.
Conclusion
Running systemd in a container is an understandable need. And thanks to Podman, we finally have a container runtime that doesn't antagonize systemd, but makes it easy to use.
Source: habr.com