What is Docker: a brief history and basic abstractions

August 10 in Slurm started Docker video course, in which we analyze it completely - from basic abstractions to network parameters.

In this article, we will talk about the history of Docker and its main abstractions: Image, Cli, Dockerfile. The lecture is designed for beginners, so it is unlikely to be of interest to experienced users. There will be no blood, appendix and deep diving. The very basics.

What is Docker: a brief history and basic abstractions

What is Docker

Let's look at the definition of Docker from Wikipedia.

Docker is software for automating the deployment and management of applications in containerized environments.

Nothing is clear from this definition. It is especially unclear what "in environments with containerization support" means. To understand, let's go back in time. Let's start with the era, which I conditionally call the "Monolithic era".

Monolithic era

The monolithic era is the early 2000s, when all applications were monolithic, with a bunch of dependencies. The development took a long time. At the same time, there were not so many servers, we all knew them by name and monitored them. There is this funny comparison:

Pets are pets. In the monolithic era, we treated our servers like pets, cherished and cherished, dust particles were blown away. And for better resource management, they used virtualization: they took a server and sawed it into several virtual machines, thereby providing environment isolation.

Hypervisor based virtualization systems

Everyone has probably heard about virtualization systems: VMware, VirtualBox, Hyper-V, Qemu KVM, etc. They provide application isolation and resource management, but they also have disadvantages. To do virtualization, you need a hypervisor. And the hypervisor is an overhead of resources. And the virtual machine itself is usually a whole colossus - a heavy image, it has an operating system, Nginx, Apache, and possibly MySQL. The image is large, it is inconvenient to operate with a virtual machine. As a result, working with virtual machines can be slow. To solve this problem, they created virtualization systems at the kernel level.

Kernel virtualization systems

Virtualization at the kernel level is supported by OpenVZ, Systemd-nspawn, LXC systems. A striking example of such virtualization is LXC (Linux Containers).

LXC is an operating system-level virtualization system for running multiple isolated instances of the Linux operating system on a single host. LXC does not use virtual machines, but creates a virtual environment with its own process space and network stack.

In essence, LXC creates containers. What is the difference between virtual machines and containers?

What is Docker: a brief history and basic abstractions

The container is not suitable for isolating processes: in virtualization systems at the kernel level, vulnerabilities are found that allow you to get out of the container to the host. Therefore, if you need to isolate something, then it is better to use a virtual machine.

The differences between virtualization and containerization can be seen in the diagram.
There are hardware hypervisors, hypervisors on top of the OS, and containers.

What is Docker: a brief history and basic abstractions

Hardware hypervisors are awesome if you really want to isolate something. Because there is an opportunity to isolate at the level of memory pages, processors.

There are hypervisors as a program, and there are containers, and we will talk about them further. There is no hypervisor in containerization systems, but there is a Container Engine that creates and manages containers. The thing is more lightweight, therefore, due to the work with the core, there is less overhead, or it does not exist at all.

What is used for containerization at the kernel level

The main technologies that allow you to create a container isolated from other processes are Namespaces and Control Groups.

Namespaces: PID, Networking, Mount and User. There are more, but for ease of understanding, we will focus on these.

PID Namespace restricts processes. When we, for example, create a PID Namespace, put a process there, then it becomes with PID 1. Usually on systems, PID 1 is systemd or init. Accordingly, when we place a process in a new namespace, it also receives PID 1.

Networking Namespace allows you to limit / isolate the network and already place your interfaces inside. Mount is a file system limitation. User - restriction on users.

Control Groups: Memory, CPU, IOPS, Network - about 12 settings in total. Otherwise, they are also called Cgroups ("C-groups").

Control Groups manage the resources for a container. Through Control Groups, we can say that the container should not consume more than a certain amount of resources.

In order for containerization to fully work, additional technologies are used: Capabilities, Copy-on-write and others.

Capabilities are when we tell a process what it can and cannot do. At the kernel level, these are just bitmaps with a lot of parameters. For example, the root user has full privileges, can do everything. The time server can change the system time: it has capabilities on the Time Capsule, and that's it. With the help of privileges, you can flexibly configure restrictions for processes, and thereby protect yourself.

The Copy-on-write system allows us to work with Docker images and use them more efficiently.

At the moment, Docker has compatibility issues with Cgroups v2, so this article focuses on Cgroups v1.

But back to history.

When virtualization systems at the kernel level appeared, they began to be actively used. The overhead on the hypervisor is gone, but some problems remain:

  • large images: they push OSes, libraries, a bunch of different software into the same OpenVZ, and as a result, the image still turns out to be rather big;
  • there is no normal packaging and delivery standard, so the problem of dependencies remains. There are situations when two pieces of code use the same library, but with different versions. A conflict is possible between them.

To solve all these problems, the next era has come.

The era of containers

When the Era of containers came, the philosophy of working with them changed:

  • One process, one container.
  • We deliver all the dependencies needed by the process to its container. This requires cutting monoliths into microservices.
  • The smaller the image, the better - there are fewer possible vulnerabilities, it rolls out faster, and so on.
  • Instances become ephemeral.

Remember I talked about pets vs cattle? Previously, instances were like domestic animals, but now they have become like cattle. There used to be a monolith - one application. Now it's 100 microservices, 100 containers. Some containers may have 2-3 replicas. It becomes less important for us to control each container. We are more interested in the availability of the service itself: what this set of containers does. This changes approaches to monitoring.

In 2014-2015, Docker flourished - the technology that we will talk about now.

Docker has changed the philosophy and standardized the packaging of the application. Using Docker, we can package the application, send it to the repository, download it from there, and deploy it.

We put everything we need in the Docker container, so the problem of dependencies is solved. Docker guarantees reproducibility. I think many people have encountered irreproducibility: everything works for you, you push to production, it stops working there. With Docker, this problem goes away. If your Docker container launches and does what it needs to do, it will most likely launch in production and do the same there.

Digression about overhead

There is a lot of controversy about overhead. Someone thinks that Docker does not carry an additional load, since it uses the Linux kernel and all its processes necessary for containerization. Like, β€œif you say that Docker is an overhead, then the Linux kernel is also an overhead.”

On the other hand, if you go deeper, then in Docker there are indeed a few things about which you can say with a stretch that this is an overhead.

The first is the PID namespace. When we put some process in the namespace, it is assigned PID 1. At the same time, this process has another PID, which is located on the host namespace, outside the container. For example, we launched Nginx in the container, it became PID 1 (master process). And on the host he has PID 12623. And it's hard to say how much this is an overhead.

The second thing is Cgroups. Let's take Cgroups by memory, that is, the ability to limit the container's memory. When it is enabled, counters are activated, memory accounting: the kernel needs to understand how many pages are allocated, and how many are still free for this container. This is probably an overhead, but I have not seen exact studies on how it affects performance. And I didn’t notice myself that the application running in Docker suddenly lost a lot of performance.

And one more note about performance. Some kernel parameters are passed from the host to the container. In particular, some network parameters. Therefore, if you want to run something high-performance in Docker, for example, something that will actively use the network, then you, at a minimum, need to tweak these parameters. Some nf_conntrack, for example.

About the concept of Docker

Docker is made up of several components:

  1. Docker Daemon - the same Container Engine; starts containers.
  2. Docker CII is a Docker management tool.
  3. Dockerfile - instructions on how to build an image.
  4. Image - the image from which the container is rolled out.
  5. container.
  6. Docker registry is an image repository.

Schematically, it looks like this:

What is Docker: a brief history and basic abstractions

The Docker daemon runs on Docker_host and runs containers. There is a Client that sends commands: build the image, download the image, run the container. The docker daemon walks into the registry and executes them. The Docker client can access both locally (to a Unix socket) and via TCP from a remote host.

Let's go through each component.

docker daemon - this is the server part, it works on the host machine: it downloads images and launches containers from them, creates a network between containers, and collects logs. When we say "create an image", this is also what the demon is doing.

Docker CLI β€” the client part of Docker, a console utility for working with the daemon. I repeat, it can work not only locally, but also over the network.

Basic commands:

docker ps - show the containers that are currently running on the docker host.
docker images - show images downloaded locally.
docker search <> - search for an image in the registry.
docker pull <> - download the image from the registry to the machine.
docker build< > - collect the image.
docker run <> - run the container.
docker rm <> - remove the container.
docker logs <> - container logs
docker start/stop/restart <> - work with container

If you master these commands and use them confidently, then consider that you have mastered Docker at the user level by 70%.

Dockerfile - instructions for creating an image. Almost every instruction instruction is a new layer. Let's look at an example.

What is Docker: a brief history and basic abstractions

This is what a Dockerfile looks like: commands on the left, arguments on the right. Each command that is here (and generally written in the Dockerfile) creates a new layer in the Image.

Even looking at the left side, you can roughly understand what is happening. We say: β€œCreate us a folder” - this is one layer. "Make the folder work" is another layer, and so on. Layer cake makes life easier. If I create another Dockerfile and change something in the last line - I run not "python" "main.py", but something else, or install dependencies from another file - then the previous layers will be reused as a cache.

Image - this is the packaging of the container, containers are launched from the image. If you look at Docker from the point of view of a package manager (as if we were working with deb or rpm packages), then image is essentially an rpm package. Through yum install, we can install the application, remove it, find it in the repository, download it. It's about the same here: containers are launched from the image, they are stored in the Docker registry (similar to yum, in the repository), and each image has a SHA-256 hash, a name and a tag.

Image is built according to instructions from the Dockerfile. Each instruction from the Dockerfile creates a new layer. Layers can be reused.

docker registry is a repository of Docker images. By analogy with the OS, Docker has a public standard registry - dockerhub. But you can build your own repository, your own Docker registry.

Container - what is launched from the image. According to the instructions from the Dockerfile, we assembled the image, then we launch it from this image. This container is isolated from other containers, it must contain everything necessary for the application to work. At the same time, one container is one process. It happens that you have to do two processes, but this is somewhat contrary to the ideology of Docker.

The requirement "one container - one process" is associated with the PID Namespace. When a process with PID 1 starts in the Namespace, if it suddenly dies, then the entire container also dies. If two processes are running there: one lives, and the second one has died, then the container will continue to live anyway. But this is about Best Practices, we will talk about them in other materials.

You can learn more about the features and the full program of the course at the link: "Docker video courseΒ».

Author: Marcel Ibraev, certified Kubernetes administrator, practicing engineer at Southbridge, speaker and developer of Slurm courses.

Source: habr.com

Add a comment