Run systemd in a container

We have been following the topic of using systemd in containers for a long time. Back in 2014, our security engineer Daniel Walsh wrote an article Running systemd within a Docker Container, and a couple of years later - another, which was called Running systemd in a non-privileged container, in which he stated that the situation had not improved much. In particular, he wrote that β€œunfortunately, and two years later, if you google β€œDocker system”, the same old article of his pops up first. So it's time for a change." In addition, we have already talked about conflict between Docker and systemd developers.

Run systemd in a container

In this article, we will show what has changed since then and how Podman can help us in this matter.

There are many reasons to run systemd inside a container, such as:

  1. Multiservice containers – many people want to pull their multi-service applications out of virtual machines and run them in containers. It would be better, of course, to break such applications into microservices, but not everyone knows how to do this yet or there is simply no time. So running these applications as services run by systemd from unit files makes perfect sense.
  2. Systemd unit files - most applications running inside containers are built from code that was previously run on virtual or physical machines. These applications have a unit file that was written for these applications and understands how they should be run. So it's still better to start services using supported methods rather than hacking your own init service.
  3. Systemd is a process manager. It manages services (shuts down, restarts services, or kills zombies) better than any other tool.

That being said, there are plenty of reasons not to run systemd in containers. The main one is that systemd/journald controls the output of containers, and tools like Kubernetes or OpenShift containers expect to write logs directly to stdout and stderr. So if you're going to manage containers through orchestration tools like the ones above, then you need to seriously consider using systemd-based containers. In addition, the Docker and Moby developers have often been strongly opposed to using systemd in containers.

Podman's coming

We are pleased to announce that the situation has finally moved forward. The team responsible for running containers at Red Hat decided to develop your own container engine. He got a name podman and offers the same command line interface (CLI) as Docker. And almost all Docker commands can be used in Podman in the same way. We often hold seminars, which are now called Change Docker to Podman, and the very first slide calls for writing: alias docker=podman.

Many do so.

My Podman and I are in no way against systemd-based containers. After all, Systemd is more often used as the init subsystem of Linux, and preventing it from working properly in containers means ignoring the way thousands of people are used to running containers.

Podman knows what needs to be done to make systemd work properly in a container. She needs things like mounting tmpfs on /run and /tmp. She likes to have a "containerized" environment enabled and waits for write permissions to her part of the cgroup directory and to the /var/log/journald folder.

When starting a container with init or systemd as the first command, Podman automatically configures tmpfs and Cgroups so that systemd starts up smoothly. To disable this autostart mode, use the --systemd=false option. Note that Podman only uses systemd mode when it sees that it needs to run a systemd or init command.

Here is an excerpt from the manual:

man podman run
...

–systemd=true|false

Running a container in systemd mode. Enabled by default.

If a systemd or init command is running inside a container, Podman will set up tmpfs mount points in the following directories:

/run, /run/lock, /tmp, /sys/fs/cgroup/systemd, /var/lib/journal

Also SIGRTMIN+3 will be used as the default stop signal.

All this allows systemd to run in a closed container without any modifications.

NOTE: systemd tries to write to the cgroup file system. However, SELinux prevents containers from doing this by default. To allow writing, enable the container_manage_cgroup boolean parameter:

setsebool -P container_manage_cgroup true

Now see what the Dockerfile looks like for running systemd in a container using Podman:

# cat Dockerfile

FROM fedora

RUN dnf -y install httpd; dnf clean all; systemctl enable httpd

EXPOSE 80

CMD [ "/sbin/init" ]

That's all.

Now we collect the container:

# podman build -t systemd .

Tell SELinux to allow systemd to modify the Cgroups configuration:

# setsebool -P container_manage_cgroup true

Many, by the way, forget about this step. Fortunately, it is enough to do this only once and the setting is saved after a system reboot.

Now we just start the container:

# podman run -ti -p 80:80 systemd

systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)

Detected virtualization container-other.

Detected architecture x86-64.

Welcome to Fedora 29 (Container Image)!

Set hostname to <1b51b684bc99>.

Failed to install release agent, ignoring: Read-only file system

File /usr/lib/systemd/system/systemd-journald.service:26 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.

Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)

[  OK ] Listening on initctl Compatibility Named Pipe.

[  OK ] Listening on Journal Socket (/dev/log).

[  OK ] Started Forward Password Requests to Wall Directory Watch.

[  OK ] Started Dispatch Password Requests to Console Directory Watch.

[  OK ] Reached target Slices.

…

[  OK ] Started The Apache HTTP Server.

That's it, the service is up and running:

$ curl localhost

<html  xml_lang="en" lang="en">

…

</html>

NOTE: Don't try this on Docker! There still need to dance with a tambourine to run this kind of containers through a demon. (Additional fields and packages will be required to make this work seamlessly in Docker, or it will need to be run in a privileged container. See details in article.)

A couple more cool things about Podman and systemd

Podman performs better than Docker in systemd unit files

If containers need to be started at system boot, then you can simply insert the appropriate Podman commands into the systemd unit file, which will start the service and monitor it. Podman uses the standard fork-exec model. In other words, container processes are children of the Podman process, so systemd can easily monitor them.

Docker uses a client-server model, and Docker CLI commands can also be placed directly in a unit file. However, after the Docker client connects to the Docker daemon, it (the client) becomes just another process handling stdin and stdout. In turn, systemd has no idea about the connection between the Docker client and the container that runs the Docker daemon, and therefore, within this model, systemd cannot monitor the service in principle.

Activating systemd via socket

Podman handles socket activation correctly. Because Podman uses the fork-exec model, it can forward the socket to its child container processes. Docker can't do that because it uses a client-server model.

The varlink service that Podman uses to allow remote clients to communicate with containers is actually invoked over a socket. The cockpit-podman package, written in Node.js and part of the cockpit project, allows people to interact with Podman containers through a web interface. The web daemon running cockpit-podman sends messages to the varlink socket that systemd is listening on. Systemd then activates the Podman program to receive messages and start managing containers. Activating systemd over a socket eliminates the need for a constantly running daemon when implementing remote APIs.

We are also developing another Podman client called podman-remote which implements the same Podman CLI but calls varlink to start containers. Podman-remote can run on top of SSH sessions, allowing you to securely interact with containers on different machines. Over time, we plan to use podman-remote to support MacOS and Windows along with Linux, so that developers on those platforms can run a Linux virtual machine running Podman varlink and have the full feeling that containers are running on a local machine.

SD_NOTIFY

Systemd allows you to delay the start of auxiliary services until the containerized service they need starts. Podman can forward an SD_NOTIFY socket to a containerized service so that the service notifies systemd that it's ready to go. And again, Docker, which uses the client-server model, does not know how.

In the plans

We plan to add a podman generate systemd CONTAINERID command that will generate a systemd unit file to manage a given container. This should work in both root and rootless modes for unprivileged containers. We even saw a request for an OCI compliant systemd-nspawn runtime.

Conclusion

Running systemd in a container is an understandable need. And thanks to Podman, we finally have a container runtime that doesn't antagonize systemd, but makes it easy to use.

Source: habr.com

Add a comment