Seccomp in Kubernetes: 7 things to know about from the very beginning

Note. transl.: We present to your attention the translation of the article by the senior application security engineer of the British company ASOS.com. With it, he begins a series of publications dedicated to improving security in Kubernetes through the use of seccomp. If readers like the introduction, we will follow the author and continue with his future contributions on this topic.

Seccomp in Kubernetes: 7 things to know about from the very beginning

This article is the first in a series of posts on how to create SecDevOps-style seccomp profiles without resorting to magic and sorcery. In the first part, I will cover the basics and internal details of the implementation of seccomp in Kubernetes.

The Kubernetes ecosystem offers a wide variety of ways to secure and isolate containers. The article is about Secure Computing Mode, also known as Kjøp Spenningsfjær Clutch Kit (XNUMX) Minarelli XNUMXmm Tp på Wheelerworks.nl! Scootere, mopeder, sykler, elsykkel .... Its essence is to filter system calls available for execution by containers.

Why is it important? A container is just a process running on a specific machine. And it uses the kernel on par with other applications. If containers could make any system calls, very soon malware would take advantage of this to bypass container isolation and affect other applications: intercept information, change system settings, etc.

The seccomp profiles define which system calls should be allowed or denied. The container runtime activates them during container startup so that the kernel can control their execution. The use of such profiles allows you to limit the attack vector and reduce damage if any program inside the container (that is, your dependencies, or their dependencies) starts doing something that it is not allowed to do.

Getting to grips with the basics

The base seccomp profile includes three elements: defaultAction, architectures (or archMap) and syscalls:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "arch_prctl",
                "sched_yield",
                "futex",
                "write",
                "mmap",
                "exit_group",
                "madvise",
                "rt_sigprocmask",
                "getpid",
                "gettid",
                "tgkill",
                "rt_sigaction",
                "read",
                "getpgrp"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

(medium-basic-seccomp.json)

defaultAction defines the default fate of any system call not listed in the section syscalls. To make things easier, let's focus on two main values ​​that will be used:

  • SCMP_ACT_ERRNO - blocks the execution of a system call,
  • SCMP_ACT_ALLOW - allows.

In section architectures target architectures are listed. This is important because the filter itself, applied at the kernel level, depends on the identifiers of system calls, and not on their names specified in the profile. The container runtime will map them to identifiers before use. The point is that system calls can have completely different IDs depending on the architecture of the system. For example, the system call recvfrom (used to receive information from a socket) has ID = 64 on x64 systems and ID = 517 on x86. Here you can find a list of all system calls for x86-x64 architectures.

In section syscalls lists all system calls and tells you what to do with them. For example, you can create a whitelist by setting defaultAction on SCMP_ACT_ERRNO, and calls in the section syscalls assign SCMP_ACT_ALLOW. Thus, you allow only the calls specified in the section syscalls, and disable all others. For the black list, change the values defaultAction and actions on the opposite.

Now we should say a few words about the nuances that are not so obvious. Please note that the recommendations below assume that you are deploying a line of business applications in Kubernetes and it is important for you that they run with the least privileges.

1.AllowPrivilegeEscalation=false

В securityContext container has option AllowPrivilegeEscalation. If it is installed in false, containers will start with (on) bit no_new_priv. The meaning of this parameter is obvious from the name: it does not allow the container to start new processes with privileges greater than its own.

A side effect of this setting being set to true (default) is that the container's runtime applies the seccomp profile at the very beginning of the startup process. Thus, all system calls necessary to start the runtime's internal processes (eg, setting user/group ids, discarding some capabilities) must be allowed in the profile.

A container that does the banal echo hi, you need the following permissions:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "arch_prctl",
                "brk",
                "capget",
                "capset",
                "chdir",
                "close",
                "execve",
                "exit_group",
                "fstat",
                "fstatfs",
                "futex",
                "getdents64",
                "getppid",
                "lstat",
                "mprotect",
                "nanosleep",
                "newfstatat",
                "openat",
                "prctl",
                "read",
                "rt_sigaction",
                "statfs",
                "setgid",
                "setgroups",
                "setuid",
                "stat",
                "uname",
                "write"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

(hi-pod-seccomp.json)

... instead of these:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "arch_prctl",
                "brk",
                "close",
                "execve",
                "exit_group",
                "futex",
                "mprotect",
                "nanosleep",
                "stat",
                "write"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

(hi-container-seccomp.json)

But again, why is this a problem? Personally, I would avoid whitelisting the following system calls (unless they're really needed): capset, set_tid_address, setgid, setgroups и setuid. However, the real difficulty is that by allowing processes over which you have absolutely no control, you are binding profiles to the container runtime implementation. In other words, one day you may find that after updating the container runtime environment (by you or, more likely, by the cloud service provider), the containers will suddenly stop running.

Tip # 1: Run containers with AllowPrivilegeEscaltion=false. This will reduce the size of seccomp profiles and make them less sensitive to container runtime changes.

2. Setting seccomp profiles at the container level

The seccomp profile can be set at the pod level:

annotations:
  seccomp.security.alpha.kubernetes.io/pod: "localhost/profile.json"

... or at the container level:

annotations:
  container.security.alpha.kubernetes.io/<container-name>: "localhost/profile.json"

Note that the above syntax will change when Kubernetes seccomp will become GA (This event is expected in the next release of Kubernetes - 1.18 - approx. transl.).

Few people know that Kubernetes has always had a bug, which caused seccomp profiles to apply to pause container. The runtime partially compensates for this shortcoming, however, this container does not disappear from the pods, as it is used to configure their infrastructure.

The problem is that this container always starts with AllowPrivilegeEscalation=true, leading to the problems voiced in paragraph 1, and this cannot be changed.

By using seccomp profiles at the container level, you avoid this pitfall and can create a profile that is tailored to a specific container. This will have to be done until the developers fix the bug and the new version (maybe 1.18?) becomes available to everyone.

Tip # 2: Set seccomp profiles at the container level.

In a practical sense, this rule usually serves as a universal answer to the question: "Why does my seccomp profile work with docker runbut doesn't work after being deployed to a Kubernetes cluster?".

3. Only use runtime/default as a last resort

Kubernetes has two options for built-in profiles: runtime/default и docker/default. Both are implemented by the container runtime, not Kubernetes. Therefore, they may differ depending on the runtime environment used and its version.

In other words, as a result of the change in runtime, the container may have access to a different set of system calls that it may or may not use. Most runtimes use Docker implementation. If you wish to use this profile, please make sure it suits you.

  Profile docker/default deprecated as of Kubernetes 1.11, so avoid using it.

In my opinion the profile runtime/default perfectly suited for the purpose for which it was created: protecting users from the risks associated with executing a command docker run on their cars. However, if we talk about business applications running in Kubernetes clusters, I would take it upon myself to argue that such a profile is too open and developers should concentrate on creating profiles for their applications (or types of applications).

Tip # 3: Create seccomp profiles for specific applications. If this is not possible, consider profiles for application types, for example, create an advanced profile that includes all web API applications in Golang. Only use runtime/default as a last resort.

In future posts, I'll show you how to create SecDevOps-style seccomp profiles, automate, and test them in pipelines. In other words, you will have no excuse not to switch to profiles for specific applications.

4. Unconfined is NOT an option

Of first Kubernetes security audit found out that by default seccomp is disabled. This means that if you don't set PodSecurityPolicy, which will enable it in the cluster, all pods for which the seccomp profile is not defined will work in the seccomp=unconfined.

Operating in this mode means that a whole layer of isolation that protects the cluster is lost. This approach is not recommended by security experts.

Tip # 4: No container in the cluster should run in seccomp=unconfined, especially in production environments.

5. "Audit Mode"

This point is not unique to Kubernetes, but it still falls into the “things to know before you start” category.

It just so happens that creating seccomp profiles has always been a tricky business and relied heavily on trial and error. The fact is that users simply do not have the opportunity to test them in production environments without the risk of “dropping” the application.

After the appearance of the Linux 4.14 kernel, it became possible to run parts of the profile in audit mode, writing information about all system calls to syslog, but not blocking them. This mode can be activated using the parameter SCMT_ACT_LOG:

SCMP_ACT_LOG: seccomp will not affect the operation of the thread making the system call if it does not match any of the rules in the filter, but information about the system call will be logged.

Here is a typical strategy for using this feature:

  1. Allow system calls that are needed.
  2. Block system calls that are known not to be useful.
  3. Record information about all other calls in the log.

A simplified example looks like this:

{
    "defaultAction": "SCMP_ACT_LOG",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "arch_prctl",
                "sched_yield",
                "futex",
                "write",
                "mmap",
                "exit_group",
                "madvise",
                "rt_sigprocmask",
                "getpid",
                "gettid",
                "tgkill",
                "rt_sigaction",
                "read",
                "getpgrp"
            ],
            "action": "SCMP_ACT_ALLOW"
        },
        {
            "names": [
                "add_key",
                "keyctl",
                "ptrace"
            ],
            "action": "SCMP_ACT_ERRNO"
        }
    ]
}

(medium-mixed-seccomp.json)

But remember that you must block all calls that are known not to be used and that have the potential to harm the cluster. A good basis for compiling a list is the official docker documentation. It explains in detail which system calls are blocked in the default profile and why.

However, there is one catch. Although SCMT_ACT_LOG supported by the Linux kernel since the end of 2017, it entered the Kubernetes ecosystem only relatively recently. Therefore, to use this method, you will need a Linux 4.14 kernel and runC version at least v1.0.0-rc9.

Tip # 5: An audit mode profile for testing in production can be created by combining blacklisting and whitelisting, and logging all exceptions.

6. Whitelist

Whitelisting requires extra work because you have to identify every call an application might need, but this approach adds a lot of security:

It is highly recommended to use the whitelisting approach as it is simpler and more reliable. The blacklist will need to be updated whenever a potentially dangerous system call (or a dangerous flag/option if blacklisted) is added. In addition, it is often possible to change the representation of a parameter without changing its essence and thereby bypass the blacklist restrictions.

For Go applications, I have developed a special tool that accompanies the application and collects all calls made at runtime. For example, for the following application:

package main

import "fmt"

func main() {
	fmt.Println("test")
}

… run gosystract as follows:

go install https://github.com/pjbgf/gosystract
gosystract --template='{{- range . }}{{printf ""%s",n" .Name}}{{- end}}' application-path

... and get the following result:

"sched_yield",
"futex",
"write",
"mmap",
"exit_group",
"madvise",
"rt_sigprocmask",
"getpid",
"gettid",
"tgkill",
"rt_sigaction",
"read",
"getpgrp",
"arch_prctl",

While this is just an example - details about the toolkit will follow.

Tip # 6: Allow only the calls you really need and block all others.

7. Lay the right foundations (or prepare for unexpected behavior)

The kernel will enforce the profile no matter what you write in it. Even if it's not exactly what you want. For example, if you block access to calls like exit or exit_group, the container won't be able to shut down properly, and even a simple command like echo hi hang it upo for an indefinite period. As a result, you will get high CPU usage in the cluster:

Seccomp in Kubernetes: 7 things to know about from the very beginning

In such cases, the utility can come to the rescue. strace - it will show what the problem may be:

Seccomp in Kubernetes: 7 things to know about from the very beginning
sudo strace -c -p 9331

Make sure the profiles contain all the system calls that the application needs at run time.

Tip # 7: Pay attention to details and make sure that all necessary system calls are included in the white list.

This concludes the first part of a series of articles on using seccomp in Kubernetes in the spirit of SecDevOps. In the following parts, we will talk about why this is important and how to automate the process.

PS from translator

Read also on our blog:

Source: habr.com

Add a comment